Caching directories in Concourse CI Pipelines

By Johannes Rudolph25. May 2017

meshcloud uses Concourse to power our continuous integration pipeline and providing continuous delivery of all components of our cloud stack, from OpenStack to Cloud Foundry deployments and the microservices making up our federation layer.

Concourse biggest strength is that all build steps consume versioned inputs and run in isolated containers with well-defined input and outputs. This makes it easy to achieve reproducible builds and deployments. However, the downside of this approach that build steps that e.g. download third-party libraries from a package manager such npm/yarn, Maven or NuGet can’t cache downloaded packages between runs of a task because each new instance of a task will be started in a clean container. For many of our microservices, this means downloading hundreds of MB of libraries and unpacking them, which often takes much longer than the actual build itself.

There’s a long-standing GitHub issue logged for Concourse that discusses various ways to improve this, however none of those have landed yet in Concourse. There’s no clear timeline in place when this will be fixed.

Problem statement

We thus need a strategy to cache downloaded packages between builds. In general, the solution should look like this:

  • fetch git repo
  • find last parent commit id that changed package file (e.g. packages.json, pom.xml)
  • check cache for package tarball
  • if package tarball has not been built yet, invoke package manager to fetch packages and generate tarball
  • download and extract package tarball

Package Caches as Resources

At first glance, this sounds an awful lot like a perfect fit for a Concourse Resource:

A resource is any entity that can be checked for new versions, pulled down at a specific version, and/or pushed up to idempotently create new versions.

In fact, there’s a variety of Resources such as the npm-cache-resource for the node.js package manager. The general pattern among these resources is that they are based on the git resource, use a paths pattern to identify package files and then build a versioned package tarball as part of their get operation. Concourse then caches the result of the get operation, which avoids unnecessarily rebuilding the cache for each build.

This was the first approach we tried at meshcloud, however we quickly discovered it has a few drawbacks:

  • Build pipelines require a significant amount of boilerplate gymnastics to try to tie a git repo to its caches (examples)
  • Even with those gymnastics in place, it’s not guaranteed that Concourse will check the git repo and the Cache Resource (based on the same repo) at the same time. This can potentially lead to inconsistent builds.
  • We need a separate Resource (+ incl. image build etc.) for each type of package manager that we use. For the package managers we use, that means implementing a bunch of new Resources.
  • We’d need to reimplement each Cache Resources on top of git-branch-heads, which we use instead of the standard git resource to build changes from each feature-branch in our repo. However, git-branch-heads does not support paths patterns

So we needed something "better" to fit our requirements.

Package Caches as Tasks

The next logical step was then to implement caching using a Concourse Task:

A task can be thought of as a function from inputs to outputs that can either succeed or fail.

This approaches allows us to easily address all drawbacks of the Resource-based approach at the expense of hiding the Resource used to store the package tarballs from Concourse. This is not a big deal though as the package tarball store is an implementation detail of the cache task, all that matters to fellow tasks in the pipeline is the downloaded packages. In our case, we chose to simply curl the package tarballs to a http file server.

I have set up a sample repo for a concourse-cached-pipeline that demonstrates our approach on GitHub. The pipeline just contains a single git resource for the repo.

The build job fetches the repo resource, and then executes the cache-a and cache-b tasks in parallel to simulate fetching from two different package managers.

- name: build-master
  - get: repo
    resource: github-master
    trigger: true
  - aggregate:
    - task: cache-a
      file: repo/ci/cache-a.yml
        BASE_URL: {{artifacts-url}}
        CACHE_PATH: concourse-cached-pipeline/caches/build/a
        cache: cache-a
    - task: cache-b
      file: repo/ci/cache-b.yml
        BASE_URL: {{artifacts-url}}
        CACHE_PATH: concourse-cached-pipeline/caches/build/b
        cache: cache-b
  - task: build
    file: repo/ci/build.yml

The job uses output_mapping to map the downloaded and extracted cache tarballs to different input directories for the build.yml task. A run of this job looks like this in the Concourse Web UI:

Each of the cache step’s build.yml looks like this:

  path: sh
  dir: repo
  - -exc
  - |
    GIT_REF=`ci/lastref **a.version`
    ci/buildcache $GIT_REF "touch ../cache/a.txt"

We first invoke the lastref script to find the last parent commit when a package-manager specific file changed, e.g. for yarn you’d check for **yarn.lock. Next we invoke the buildcache, passing it the found git ref and a command that it can use to build the cache should it not exist yet (e.g. for yarn you’d probably do a yarn install).

The lastref and the buildcache scripts are very lightweight and the resulting build pipeline is very flexible and easy to reason about. The scripts should run in any standard docker image with a proper bash and curl, so you can easily incorporate them in your existing CI pipelines. We have achieved great performance improvements running a high number of builds per day using this approach and we’d be glad to see it work well for your pipelines too.