Cloud Tagging and Labeling on Azure, AWS and GCP - (Cheat Sheet 2022)

Are you looking for Azure tag requirements, AWS tagging documentation or you want to know how to use GCP labels? You have come to the right place!

In this post, we want to give you an overview of how different cloud platforms handle tags or labels.

You can see this as a cheat sheet to help you navigate the cloud tagging or labeling specifics of Azure, AWS, and GCP.

This post will go into detail on questions like:

  • "How many characters can a tag have in Azure?",
  • "How many tags can be assigned to one resource in GCP?" or
  • "What characters are not supported for tags in AWS?".

So let's dive right in!

Using Tags in Azure

Here's what you need to know to get started with tagging in Azure:

  • A resource does not inherit tags hierarchically from the respective resource group
  • You can assign up to 50 tags to a single resource. If you need more, there is a little trick to add more: Creating tags with multiple values is a valid workaround.
  • The maximum key length in Azure is 512. For values it's 256.
  • Tags in Azure are not case-sensitive
  • With tag keys you must not use the characters: < > % & / ?

How tags work with AWS

These are the AWS tagging specifications you need to follow:

  • You can assign 50 tags per resource
  • Tag keys must be unique for each resource and can only have one value
  • The character limit for keys is 128, for values it's 256
  • You can use the allowed characters across all AWS services: letters, numbers, and spaces representable in UTF-8, and the following characters: + - = . _ : / @
  • EC2 allows for any character in its tags
  • Tag keys and values are case-sensitive
  • The aws: prefix is reserved for AWS use. If a tag has a tag key with this prefix, then you can't edit or delete the tag's key or value. Tags with the aws: prefix do not count against your tags per resource limit.

Tagging with GCP

First: Google calls its tags in GCP "labels" - they are still tags though.

Let's see what requirements and restrictions GCP labels have:

  • A resource can have up to 64 labels assigned to it
  • Both keys and values have a maximum length of 63 characters
  • Keys and values can contain: lowercase letters, numeric characters, underscores and hyphens
  • You are able to use international characters
  • Label keys must start with a lowercase letter
  • Label keys cannot be empty

Cloud Tagging At a Glance

Constraint/Platform Azure AWS GCP
Max. # of tags 50 50 64
Max. tag name length 512 128 63
Max. tag value length 256 256 63
Case sensitive? no yes yes
Allowed characters < > % & / ? are not allowed letters, numbers, spaces, and + - = . _ : / @ lowercase, numbers, underscore, hyphens

meshcloud offers a global management of tags in multi cloud architectures and makes sure they comply with the specific platform requirements.

To learn more about the meshcloud platform, please get in touch with our sales team or book a demo with one of our product experts. We're looking forward to get in touch with you.

Log4Shell: meshcloud NOT affected by Log4J Vulnerability

Researchers have found a critical zero day vulnerability in the Apache Log4J library. Our solution meshStack is NOT affected. Our engineers checked the meshStack services and dependencies and confirmed that our solution does not include affected Log4J modules.

What is the Problem with Log4J?

Apache Log4J is a widely used library for logging errors.

The recently discovered vulnerability CVE-2021-44228 - dubbed Log4Shell or LogJam - was given a CVSS severity level of 10 out of 10. It's a vulnerability that seems to be easy to exploit and enables attackers to execute code remotly.

The german Federal Office for Information Security (BSI) warns of the Log4J vulnerability with the highest warning level "4/Red". It goes on to say that "the extent of the threat situation cannot currently be conclusively determined." The BSI also warns that the vulnerability is already being actively exploited, for example for cryptomining or other malware.

When are you affected by Log4Shell?

The first important part is, that the vulnerability is located in the log4j-core module. Other libraries like Spring Boot are only using log4j-api by default and are therefore not affected by this vulnerability (see Spring Boot's blog post on it).

The vulnerability can only be exploited, if your application logs input that was provided by users in any way (e.g. via an API or via a UI). They could provide certain messages that lead to Log4J executing some code from remote which can be used to access your server and execute some malware on it. As long as you and the libraries you are using are not logging any input given by users, you should not be affected by the vulnerability. But especially with the usage of other libraries it might be hard to judge on whether you are actually affected or not. So if you are using the Log4J for logging, you should - regardless of whether you think that messages provided by users are logged - follow the recommendations below.

Current Recommendations

The Apache Foundation recommends to update the library to version 2.15.0 or - if not possible - to follow the instructions on their Log4J security vulnerabilities page to mitigate the risk.

Cloud cost dashboard build in google data studio displaying cloud billing information in tables and charts.

Open Source Cloud Cost Dashboard in under 10 Minutes

Cloud cost dashboard build in google data studio displaying cloud billing information in tables and charts. You can build a cloud cost dashboard like this in under 10 minutes.

Three easy steps to get to your cloud cost dashboard

Everyone loves good dashboards. If done well, cloud cost dashboards can give you all the vital information that you need in a single overview. This is incredibly important with managing your cloud costs.

In this blog post, we will show you how you can build a cloud cost dashboard. See your cloud expenditure at a glance in AWS, Microsoft Azure & (soon to be) Google Cloud. The best part: it is completely free and doesn’t take longer than 10 minutes. This is thanks to our open-source multi-cloud CLI Collie which we recently launched on GitHub. To get to your cloud cost dashboard, we will follow along with these steps:

  • Step 1: Preparing cloud cost data with metadata using tags
  • Step 2: Extracting the data from the clouds
  • Step 3: Building a cloud cost dashboard

At the end of this guide, you will have a cloud cost dashboard that looks something like this.

To follow along with this blog post, you need a license for Google Data Studio. It should be included for free in G-Suite. You could also use another dashboard tool such as Microsoft PowerBI. As long as it supports CSV files as data import.

Step 1: Preparing cloud cost data with metadata using tags

Your cloud cost dashboard is only as good as the data it is built upon. That’s why the first and most important step is to prepare the necessary cost data with the right metadata. Without proper metadata, it is going to be difficult to filter and view information from certain angles. One vital part of building good cost data from the public cloud providers is the use of tags. The more tags you use, the more questions you can answer for yourself or your management:

  • Which team (or department) is spending the most in the cloud?
  • Which cost center is spending the most in the cloud?
  • Which cloud platform has the highest usage?
  • How much are we spending on development stages?
  • How is the expenditure of cloud-native projects vs. lift-and-shift projects?
  • Whom do I need to contact for more information about this project?

Step 2: Extracting the data from the clouds

Once you are happy with the metadata you applied to your projects, it is time to export the cost data. This allows it to be imported into your cloud cost dashboard later on. To make the export as easy as possible, we will use our recently launched Collie CLI. Collie can export all cost data with one single command in a CSV file. We will then prepare this CSV file using Google Sheets so it can be used as a data source for the dashboard.

To do this export with Collie we need to execute the following steps:

  1. Before installing: make sure you have properly set up the CLIs of the cloud platforms.
  2. Install Collie as explained here.
  3. Run the cost export command for a given time interval. Make sure to use whole months as the cost data is on a monthly basis. The following command would work for Q1 & Q2 of 2021:

collie tenant costs --from 2021-01-01 --to 2021-06-30 -o csv > q1_q2_2021_export.csv

  1. You should now have a CSV export with the cost data of your cloud(s). The metadata tags are provided as extra columns, which is important for when we build your cloud cost dashboard.

Next up, we will import the CSV data into a Google Sheets spreadsheet. Follow these steps:

  1. Create a new Google Sheet (hint: navigate to
  2. At the menu at the top, open "File" and click "Import". Navigate to the "Upload" tab in the dialog and upload the CSV file from before in this dialog.
  3. Make sure to untick the checkbox that says "Convert text to numbers, dates, and formulas". Confirm the import by clicking "Import Data".
  4. Open the new spreadsheet by clicking "Open now". Make sure to name this new spreadsheet something that you can remember later.

That’s it! We now have a well-prepared spreadsheet that we will use to power your new cloud cost dashboard.

Step 3: Building your cloud cost dashboard!

Okay, the data is ready! We can start building your cloud cost dashboard now. To make things easier, we have already prepared a template for you. You can find it here. Let’s link it to the data source that we set up before. Follow along the next steps to do so:

  1. Open the dashboard link mentioned above and make a copy of the dashboard. To do so, click the settings icon and click "Make a copy" as shown in the screenshot below.

Google Data Studio screenshot showing how to make a copy.

  1. Data Studio will now ask you to enter a new data source. We will use the Google Sheet from before. To do so, click the dropdown below ‘New Datasource’ and click ‘Create new data source’.
  2. Select the connector called ‘Google Spreadsheets’ and a new window should pop up that allows you to search for Google Sheets in your Google Drive. Try to find the one you created before and confirm the connection by clicking the blue ‘Connect’ button at the top right.
  3. Before creating this new data source, we need to clean up our data types a tiny bit. In the menu that just opened in front of you, scroll to the ‘from’ dimension and change its type from ‘Date & Time’ to ‘Date & Time → Year Month’.
  4. Confirm the creation of the data source. Click the "Add to Report" button at the top right.
  5. At last, copy over your new report by clicking "Copy report".

That’s it! You’re looking at your new cloud cost dashboard. The cloud cost dashboard template we created offers various features:

  • Viewing total costs on a monthly basis
  • Viewing costs per cloud platform
  • Filtering on dates
  • Viewing costs across various tags, in our case:
    The owner (who owns this cloud account)
    Cost Center (a way of allocating costs to budgets)
    Environments (development, production, etc)
    Zone (is it a cloud-native or lift-and-shift project)
    Departments (which unit owns the project)
  • Viewing tenants (and their metadata) with the highest cost.

There is a good chance that you have different metadata than what we used in the dashboard template, which might break one or more charts. We recommend tweaking your cloud cost dashboard to your needs, but we hope that this template helps you off to a great start 🚀

Get Started with your own Cloud Cost Dashboard!

Now that you have reached the end of this post, you see that it doesn’t have to be rocket science to build a powerful cloud cost dashboard. Leverage the power of our open-source Collie CLI and you have the necessary cost data extracted within minutes. Duplicate our Google Data Studio Dashboard, connect it to the CSV export and you’re done!

What questions are you going to answer with your new cloud cost dashboard? Let us know in the comments below!

What’s next?

Collecting costs isn’t the only thing that Collie can do 😉 Curious what else Collie CLI can do for you? Head over to our meshcloud GitHub page and find out!


Want to move your cloud financials to the next level? Head over to our cost management solution and learn how we can help you!

How to Implement Declarative Deletion

At meshcloud we've implemented a declarative API and the biggest challenge has been the declarative deletion of objects.

That's why in this blog post I want to answer the question:

How do I implement deletion for a declarative API?

Ahead I cover the challenges we ran into, how other systems solve it and which solution we applied at meshcloud.

Shoot through to my blog post about implementing a declarative API if you want to start at the beginning of this two part endeavour.

For the topic of declarative deletion let's start by having a look at the advantages of declarative deletion:

Why handle deletion declaratively?

If your use-case fits a declarative API you should also think about the deletion of objects.

If an external system syncs objects into your system and also ensures that objects are deleted when they are no longer present in the primary system, a declarative API simplifies client code a lot.

If deletion could only be executed in an imperative way, the client would have to find out which objects actually have to be deleted in the target system to call the delete endpoint for those objects accordingly.

Let's have a look at the group synchronization process:

sync of groups from the IdP to your system

As you can see the client somehow needs to build a diff between the desired state and the actual state in the target system to determine which objects have to be deleted. Moving this complex logic to the server side extracts this high implementation effort to only do it once instead of n-times for different clients.

Additionally it can provide a big performance improvement as quite a lot of data might be needed to do the diff. If that data only needs to be handled by the backend, you can get rid of network latency and bandwidth limitation for getting that data to the client.

In case of handling it at the client you may also struggle with outdated data, as getting the current state and processing it takes some time during which the state may already have changed in the target system.

How to implement deletion for a declarative API

The central conceptual question for deletion is

"How can you identify which objects need to be deleted?".

As not only one client will be using your API, a central aspect is to group objects into a Declarative Set. If one item in this set is missing, it will be deleted.

But items from another set are untouched. For understanding how to solve declarative deletion, let's have a look at actual implementations of it that are productively in use.

Another implementation challenge is an efficient algorithm for comparing the actual state with the desired state. This topic isn't covered in this blog post. Aspects like reducing the amount of DB queries, doing bulk queries for the objects that are part of the Declarative Set as well as bulk deletions are things that should be considered.

How do existing declarative systems handle declarative deletion?


In Terraform, you define a tf file that contains the desired state. When this state is applied via terraform, it writes atfstate locally or to a shared remote location. That way Terraform knows which state is actually expected in the target systems. It can create a diff between the actual state (tfstate) and what is expected to be the new state (tf ). This requires an up-to-date state file. It deletes all resources that have been present in tfstate but are no longer present in tf.

The tf file is basically what I described as a Declarative Set before. So in case of a shared remote location that keeps track of the state, it is always related to the tf file and will ever only delete resources that are in the related tfstate.

Terraform also provides a terraform plan command, which will show you the actual changes Terraform would apply. That way you can verify whether the intended changes will be applied.

Additionally Terraform provides an imperative CLI command to delete specific resources or all resources in a tf file (-target can be defined optionally to delete specific resources).

terraform destroy -target RESOURCE_TYPE.NAME -target RESOURCE_TYPE2.NAME


The recommended approach in Kubernetes is using the imperative kubectl delete command to delete individual objects. They recommend it, because it is explicit about which objects will be deleted (either a specific object or the reference to a yaml file).

But Kubernetes also supports the declarative approach. You have to kubectl apply a complete configuration folder. Additionally you have to use the alpha option --prune that actually allows deleting all objects that are no longer present in the yaml files inside the folder (see Kubernetes docs). The different ways of object management in Kubernetes are described very well here.

The kubectl apply command also provides an option for a dry-run to verify whether the intended changes will be applied.

How Kubernetes actually handles declarative deletion is explained very detailed in their documentation.

Here's a brief summary:

Kubernetes saves the complete state on server-side. When doing a kubectl apply --prune you have to either provide a label with -l <label> or refer to all resources in the namespaces used in the folder's yaml files via --all. Both ways match to what I described as a Declarative Set before. They make it possible for Kubernetes to know which objects actually belong to the set of resources that are part of the intended desired state. So when you apply again e.g. with the same label, it will simply delete all objects with this label, if they don't exist in the folder anymore. Using the label is the safer approach, as just deleting every no longer present resource in all namespaces that are referenced in a tf file is rather dangerous.

Also regarding what needs to be updated in an object, Kubernetes uses an interesting approach. It sets the actual configuration that was present in the yaml file into a last-applied-configuration metadata on the object. That way it will only e.g. delete attributes that have been present in a previous application but not anymore. It does not overwrite other attributes that have only been set via implicit commands. So it does an actual patch, based on what was before and is currently present in the yaml file.

The determination of what changes need to be applied has traditionally been done in kubectl CLI tool. But recently a server-side apply implementation has been released as GA.

Sadly they didn't touch multi-object apply yet. It remains in the kubectl CLI. So actual declarative object deletion is not part of the server-side implementation.

AWS Cloud Formation

AWS Cloud Formation uses so called Stacks in which they are grouping created resources. The Stack is what I called Declarative Set before. You can update an existing stack by applying a Cloud Formation Template to that Stack. You can modify the template and re-apply it. This modification can also contain removing resources. They will then be deleted in AWS when the Cloud Formation Template is applied to the Stack again.

How does meshStack handle it?

In meshStack we provide an API that takes a list of meshObjects and creates and updates them. If you want to apply declarative deletion you have to provide a meshObjectCollection. This is how we decided to implemenent the Declarative Set I mentioned before. This is similar to adding a label to kubectl apply or using a Stack in AWS. meshObjects no longer present in the request will be deleted if they had been applied before using the same meshObjectCollection parameter. The check in our backend is rather simple as we just added a meshObjectCollection field to our entities and can query by it. That way we can do a diff between what is applied in the request and what existed already before.

When actually executing the import of meshObjects, the result contains information about whether meshStack was able to successfully reconcile the desiredState or which error occurred for a specific meshObject. A failed import of one meshObject won't result in breaking the complete process. The response will only contain details about the import result of every single meshObject.

Currently we only support access to our API for a few external systems that integrate with meshStack. Once we provide a CLI and every user can use the API that way, meshObjectCollections will be assigned to projects. That way a clear separation of access to meshObjectCollections can be guaranteed.

Scalability of this approach is definitely a topic we will have to solve in future. With the current approach all objects have to be applied in a single request. If a huge number of objects shall be processed during one request, timeouts or other performance issues could arise. A possible solution we are thinking about is doing it similar to Kubernetes. You can define your intended objects in several yaml files inside a folder. Those can be uploaded to meshStack via a CLI or an according REST call. Processing of these objects will be done asynchronously once all files are uploaded.

When to use declarative deletion

Deletion support in a declarative API can be a really nice comfort feature for your clients, as a lot of complexity will be removed for them.

declarative synchronization

Removing no longer existing objects in the request will be handled by the declarative system.

The downside of declarative deletion is that it is implicit and can easily result in removing objects that were not intended to be removed, just because they were not part of the request anymore. As long as you are managing stateless objects it might be fine to take that risk as you could simply recreate them with the next request. If objects are stateful (e.g. a databases or volumes), it might be a bad idea to remove the resource and recreate it again. In that case all data will be gone. But even in the stateful use-cases you can reduce the risk by:

  • making declarative deletion explicit via an additional parameter that needs to be provided, so the client is actually aware that declarative deletion will be applied.
  • exclude certain objects from deletion (e.g. volumes and databases). Those can only be deleted in an imperative way. It could also be an option to let the end-user decide which objects should be excluded from the declarative deletion by flagging them accordingly.
  • implement a dry-run functionality like terraform plan that will show the client which state will be applied in the end. This is a good option when providing access to the declarative API via a CLI for example. In case of an automated integration between two systems, it is not helpful as there is no-one to check the result of the dry-run. Still, some automation might be possible, but that would again require some complex logic on the client side, which we wanted to avoid in the first place.

In general it makes sense to additionally provide an imperative deletion, as the declarative deletion should always be an opt-in the client explicitly has to choose. The client always needs to be aware of the consequences the declarative deletion implies and that the client might need to be extra careful to always provide a complete list of resources.

Should I provide a declarative API? (You probably should)

Declarative APIs are becoming more and more popular, especially in the context of Infrastructure as Code.

At meshcloud we've implemented a declarative API. In this post I want to provide insights into the process and answer these questions:

  • Does it make sense to provide declarative APIs for all systems?
  • Which use-cases benefit from it and which don't?

But first things first, let's start with a look at what a declarative API actually is all about:

What is a declarative API?

At first let's have a look at the classical way of implementing an API. That is implementing it imperatively. With imperative APIs you have to provide dedicated commands the system has to execute: Create this VM, update the memory settings of this VM, remove a certain network from this VM, etc.

A declarative API is a desired state system. You provide a certain state you want the system to create. You don't care about all the steps needed to achieve that state. You just tell the system: "Please make sure that the state I provide will be there."

This approach is best known from Infrastructure as Code tools like Kubernetes or Terraform. You tell them that you want a certain set of resources with some given configuration. They take care of creating and updating the resources.

Why provide a declarative API?

With a declarative API you move complexity from the consumer of the system to the system itself. Creating, updating and even deleting objects is no longer a customer concern.

That means you can provide a way simpler API for your consumers by providing a declarative API to your system for some use-cases. This results in a reduced amount of errors due to misunderstandings between client and API provider.

It is for sure not the ideal solution to all use-cases, but more about that later.

Let's have a look at an example that shows how a declarative API can simplify the consumer's interaction with your system.

Example: Synchronizing Groups

Imagine you have a user group synchronized between a central Identity Provider (IdP) and your system (target). A group would have these properties:

    id = "123-456"
    displayName = "My Group"
    members = ["uid1", "uid2"]

Your system provides integrations for multiple IdPs and multiple clients of you API exist. These clients should focus on getting information from the IdP - and then on getting it into your target system.

Solving it with an Imperative API

Now imagine that you have an imperative API in the target system: What do you have to do to always keep all groups in sync?

Let's at first have a look at the operations that are available:

  • createGroup: Creates a group with the given attributes. If a group already exists, an error is returned.
  • updateGroup: Updates an existing group with the given attributes. If the group does not exist yet, it returns an error.
  • deleteGroup: Deletes a group by its id.
  • getAllGroups: Returns all groups that are available in the target system. This endpoint could provide some filter options, but this is not relevant for this blog post.

What you need to do for a full sync of groups from the IdP to your system:

sync of groups from the IdP to your system

As you can see, creating such a synchronization process is a rather complex task. Especially in the given example. You want a lightweight solution to implement multiple clients for the different IdPs.

This complexity requires a lot of effort every time you integrate a new IdP at a customer.

Solving it with a declarative API

In cases like the group sync, moving all the complex update logic to the target system and providing a declarative API simplifies the client by a lot.

How would a declarative API look like?

  • applyGroups You only need one operation to which you provide a list of groups, like this:
        id: "123-456"
        groupName: "developers",
        members: ["dev1", "dev2"]
        id: "456-789"
        groupName: "managers",
        members: ["manager1"]
        id: "789-132"
        groupName: "operators",
        members: ["op1"]

On the next synchronization dev2 became a manager and has been moved to the managers group. In addition, the operators group has been removed, as the company decided to go with a DevOps team. So op1 has been moved to the developers group and the developers group has been renamed to devops. All you have to do is run the exact same process as before, which is:

declarative synchronization

That means the second call will be looking like this:

        id: "123-456"
        groupName: "devops",
        members: ["dev1", "ops1"]
        id: "456-789"
        groupName: "managers",
        members: ["manager1", "dev2"]

The target system will take care of removing dev2 from the developers group and will add it to the managers group. It will rename the developers group to devops and add member ops1 to that group. It will also take care of removing the operators group.

For sure, all the logic mentioned in the imperative approach now needs to be implemented in your target system. BUT you will only have to implement it once.

You may even come up with an architecture of your system that simplifies implementation of that declarative approach further.

This example is limited to a holistic synchronization of all groups all the time. If you have multiple sources you get your input from, you need some kind of bucket for the groups coming from one system. You could e.g. simply add another input to the applyGroups function that allows submitting an additional bucket parameter. That allows your system to only take all groups related to the given bucket into consideration for the consolidation of which groups to create, update or delete.

As this is especially relevant for deleting groups, more details about this will be part of my upcoming blog post on "How to implement deletion for a declarative API?".

The as-code advantage

Another great advantage of the declarative approach is that you can store what you applied in a version control system (VCS). That approach provides you several advantages.

  • You have a full history of all changes
  • You have a nice overview of the expected state in a system.
  • You can work cooperatively on the desired state with a whole team.

Kubernetes or Terraform are good, real-world examples of storing the desired state in a VCS. The Kubernetes YAML files and the Terraform files are usually stored in a VCS.

If in contrast, you only apply imperative commands to the target system, you would always have to ask the target system about the current state. You may also override changes or reapply changes someone else had done before.

Declarative vs Imperative API, opponents or a nice team?

You may ask yourself: Should I build a completely declarative system without providing any imperative commands?

In some rare cases that might be the right thing to do. In most cases it makes much more sense to combine an imperative and a declarative API. Actually that is what the big Infrastructure as Code tools do. In the Kubernetes documentation you can find an imperative as well as a declarative way of managing your Kubernetes objects.

Besides use-cases like the group synchronization or Infrastructure as Code that profit a lot from a declarative system, there are also use-cases that benefit from an imperative API.

So you should decide dependent on the different use-cases in your system which API fits better for which case.

Example for less effort with Imperative API

Let's extend our previous example with the user-groups and say you need these groups to assign them to projects in your system. These projects cannot simply be created in your system, but a central workflow tool with an approval process must be used to create new projects. After the workflow completed, this tool will create the according project in your system via your API. Afterwards the project will only be maintained in your system. There will be no updates coming from the workflow tool.

project creation workflow

An imperative approach with an operation createProject is just the right thing for that use-case. Sure, it would also work with a declarative approach. You'd just call the apply operation once to create the project. But as only a create operation will be done once, there is no need for the complex handling of updating or deleting existing projects. So you can save quite a lot of effort by not implementing the declarative handling in your system, if you won't use it.

Example for simpler client implementation with Imperative API

Another example is only updating a certain attribute of e.g. the project. Let's say the project has some tags that can be set on it. Tags are simple key/value pairs. If you want to update them in a declarative API, you have to provide the complete project like this:

    id: "123-456",
    displayName: "My Project",
    assignedGroups: [
    tags: [
        environment: "dev"

But if a system - that only knows about the project id and the tags - provides the tags, how does it get the other information it needs to update the project? It would have to call a get endpoint on the target system first to get all data of the project first. Then it could set the new tags and update the project.

declarative attribute update

For this use-case an imperative API makes the client's life much easier. It could just call an imperative operation like addTag or setTags.

target addTag "environment" to "prod"
target setTags ["environment" to "prod", "confidentiality" to "internal"]

When should you provide a declarative API?

In the end you have to decide per use-case which kind of API design fits best. The following comparison provides some advantages of the 2 approaches. They can help you in making that decision.

Imperative API

  • finer control on client side
  • fits perfectly fine when only creating new objects and not keeping these objects in sync after creation
  • easier to use when only updating partial data (especially when attributes of an object are sourced from multiple different systems)

Declarative API

  • way easier client code for data synchronization scenarios as the client only needs to provide the desired state and the rest is done by the target system
  • version control option for desired state
  • single point of implementation at server side
  • hide complexity of creating a certain state in the backend. Clients don't have to care about that complexity

In general, I think the following rule of thumb can also help with that decision:

  • Use a declarative API if you want to keep objects in sync with a central source. This is especially true if you have multiple clients of that API, as the complex handling of updates and deletion only have to be implemented once centrally in your system.
  • Use an imperative API for one-time operations. You just want to create, update or delete one specific thing and afterwards you don't care about the object's lifecycle anymore. In that case you should consider an imperative API.

But now I'd like to hear from you: Have you implemented a declarative API? What challenges did you come across? Did I miss anything in my blog post? Let me know!

I recommend you to visit our blog where you will find many interesting posts on engineering topics: e.g. our guides to TLS/SSL certificates or the guide to testing IaC.

A Developers Practical Guide to TLS/SSL Certificates

For many developers certificates are a black box: It is old tech with terrible documentation and the underlying encryption is complex and hard to understand. This practical guide to TLS/SSL certificates will help you navigate these hazardous waters.

In this guide you will learn:

  • Reasons why a certificate may be invalid
  • All about self signed certificates
  • Certificate formats
  • Components of a certificate

Why you should care about TLS/SSL certificates

Certificates play a central role in IT and cloud security. Failure to understand them and handle them correctly can result in serious damage to business:

Certificate vs. public key vs. private key vs. CSR

When using TLS certificates you will be working with different parts that work together in different use cases:

Key pairs consist of private and public keys that belong together and are used for asymmetric encryption:

  • Private key
    • must be kept secret (not part of a certificate)
    • decrypt messages encrypted with the respective public key
    • sign messages
  • Public key
    • shared with others (included with a certificate)
    • encrypt messages for the holder of the respective private key
    • verify messages were signed by the respective private key

Certificate signing requests (CSR) are basically unsigned certificates. They contain all information required for creating a certificate including the public key (but not the private key). They're presented to a certificate authority (e.g. a well-known certificate issuer) which can then validate that everything is in order (e.g. if someone requests a certificate for the certificate issuer should ensure that the requestor actually controls that domain), and use its own private key to create a signed certificate from the information contained in the CSR.

Think CSR = certificate - issuer signature.

A certificate contains many pieces of information with the following being of most practical importance:

  • Subject: Who is the owner of the certificate?
  • Subject Public key: Used for communicating with the certificate owner.
  • Issuer: Who signed this certificate?

Certificates are usually not secret since they contain no private information, however, they're sometimes bundled together with their private keys, so care must be taken.

TLS certificates contain different pieces of information, you can look at the contents of a certificate using openssl. When working with certificates it's very likely that you will encounter the following parts:

Information about the certificate subject, who does this certificate belong to?

The Common Name (or CN) is the most basic piece of identifying information about the subject of a certificate and you may well encounter certificates that provide only CNs and no further subject information. Previously the common name was used to verify that a certificate was used for the correct host, so in most cases, certificates still use a hostname for the common name (as seen above).

CN validation may still work with legacy applications but current browsers and libraries are no longer using the common name for validation, instead, they use the Subject Alternative Name (or SAN).

Using SAN has the added benefit of working with a list of domain names and that you may also include IP addresses.

Issuer information about the certificate issuer, which authority signed this certificate?

$ openssl x509 -in -noout -text | grep 'Issuer:'
Issuer: C = US, O = DigiCert Inc, CN = DigiCert TLS RSA SHA256 2020 CA1

It also contains the Authority Key Identifier which allows you to identify the key pair used for signing this certificate.

$ openssl x509 -in -noout -text | grep -A1 'Authority Key Identifier'
X509v3 Authority Key Identifier:

Formats of certificates

Common formats for TLS certificates are

  • DER encoded binary data (.der, .cer, .crt)
  • PEM: Base64 encoded DER (.pem but also often .cer and .crt), files begin and end with -----BEGIN CERTIFICATE-----/-----END CERTIFICATE-----. Used by most *nix applications/libraries.
  • PKCS12: Binary data, may include private key, optionally password protected (.pfx, .p12).

Reasons why a certificate may be invalid

As a developer, you're most likely to get into certificates when something doesn't work. Usually, this is because a certificate can't be validated. Here are some of the most likely reasons why a certificate is invalid.

Expired or not yet valid

Certificates are only valid for a specific time window, so not only can they expire they may also be not valid yet.

# Lookup certificate validity for a local certificate.
$ openssl x509 -in -noout -text | grep -A2 'Validity'
  Not Before: Nov 24 00:00:00 2020 GMT
  Not After : Dec 25 23:59:59 2021 GMT

Some clients may also reject certificates that are valid for too long periods (typically 1 or 2 years), e.g. Safari does not trust certificates that are valid for more than 1 year plus a grace period of 30 days.

Let's Encrypt certificates are also only valid for 90 days which is why they provide tools for automated renewals.

Name validation fails

TLS certificates are only valid for a specific set of domains/addresses which are specified in a certificate's Subject Alternate Name (SAN) field.

# Looking up SAN of a local certificate
$ openssl x509 -in -noout -text | grep -A1 'Subject Alternative Name:'
X509v3 Subject Alternative Name:,,,,,,,

The field contains a list of DNS names and/or IP addresses that this certificate is valid for. DNS names may also contain wildcards like * but note that this does not cover additional subdomains like or the naked domain name

Since SAN was not always present in certificates you may encounter legacy applications that still rely on validating against the Common Name (CN) of a certificate, which is why this is usually set to the domain name as well.

# Looking at the subject information of a local certificate
$ openssl x509 -in -noout -text | grep 'Subject:'
Subject: C = US, ST = California, L = Los Angeles, O = Internet Corporation for Assigned Names and Numbers, CN =

When working with recent applications and libraries setting the common name is not enough and you definitely need to specify SAN.

No trust

Certificates are not trusted automatically but rely on a big bundle of trusted certificate authorities (CA) that are usually distributed as part of an operating system or firmware. Your operating system or device is configured to trust all certificates signed by these CAs. If you encounter a certificate that is signed by a different CA it will be rejected.

Applications may also decide to use their own sets of trusted certificates, e.g. the Java Keystore.

If a certificate is not trusted even though it has been issued by a well-trusted issuer there may be missing intermediate certificates.

Consider the following example: a trusted CA A has issued a certificate to another CA B which allows them to issue certificates of their own. The certificate you're seeing has been issued by B and since B is not included in your trusted certificates the certificate is rejected. To avoid this issue everyone using certificates issued by B should also include the certificate which verifies that B may issue certificates (issued by A). This is an intermediate certificate that allows clients to verify that there is an intact chain of trust from A to B to the certificate they're actually concerned with.

It's of course also possible to encounter certificates from CAs that are simply not trusted e.g. because they're only used for private or internal purposes. If you are certain that such a CA should be trusted you can of course add them to your locally trusted CAs.

Forbidden usage

Certificates can be restricted in their usage, e.g. when you receive a signed certificate from a well-known issuer you're not allowed to sign other certificates, i.e. you may not act as a CA yourself because then you could sign any certificates you wanted to without oversight.

# Checking allowed usage
$ openssl x509 -in -noout -text | grep -A1 'Usage'
X509v3 Key Usage: critical
  Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
  TLS Web Server Authentication, TLS Web Client Authentication

Above you see the typical set of allowed usages for a TLS certificate used for HTTPS, note that certificate signing is not included.

Self-signed certificates

Self-signed certificates are frequently used for testing purposes or in ad-hoc situations. They're signed by the same key that is used for the certificate itself. This implies that the certificate metadata has not been verified by an external entity which is why they're considered not trustworthy.

To learn more about the meshcloud platform, please get in touch with our sales team or book a demo with one of our product experts. We're looking forward to getting in touch with you.

Building a generic Cloud Service Broker using the OSB API

This post gives an overview of OSB API service brokers and introduces an open-source generic OSB using git.

If you work as an Enterprise Architect, in a Cloud Foundation Team, in DevOps - or you're just interested in implementing the OSB API - this post is for you.

In this post we will answer the questions:

  • How can cloud services be distributed enterprise wide?
  • What are cloud service brokers?
  • How does the OSB API work?
  • What are the advantages of the OSB API?

Also, we'll go into detail on our generic Unipipe Service Broker that uses git to version cloud service instances, their provisioning status, and the service marketplace catalog.

The Service Marketplace as Central Distributor for Cloud Services

Platform services play an increasingly important role in cloud infrastructures. They enable application operators to quickly stick together the dependencies they need to run their applications.

Many IT organizations choose to offer a cloud service marketplace as part of their service portfolio. The marketplace acts as an addition to the cloud infrastructure services provided by large cloud providers like AWS, GCP, or Azure.

Service owners can build a light-weight service broker to offer their service on the marketplace. The broker is independent of any implementations of the service itself.

What exactly is a Service Broker and how does OSB API work?

The service broker handles the order of a cloud service and requests a service instance on the actual cloud platform. The user can choose from a catalog of services - e.g. an Azure vNet instance - enter necessary parameters and the broker takes care of getting this specific instance running for the user. In this example, the cloud service user would specify the vNet size (how many addresses do you need?), if it needs to be connected to an on-prem-network and so on.

A popular choice for modeling service broker to marketplace communication is the Open Service Broker API. The OSB API standardizes the way cloud services are called on cloud-native platforms across service providers. When a user browses the service catalog on the marketplace, finds a cloud service useful for his project, and orders it an OSB API request is invoked for provisioning the service.

Building a Cloud Service Broker using the OSB API and GIT

At meshcloud we offer a Service Marketplace with our cloud governance solution that communicates via the OSB API. A service broker has to be build to offer services on this marketplace: We started an open source project to provide developers with a generic broker called the Unipipe Service Broker.

The idea is an implementation of GitOps.
To understand how UniPipe can help to implement GitOps, consider the following definition:

The core idea of GitOps is (1) having a Git repository that always contains declarative descriptions of the infrastructure currently desired in the production environment and (2) an automated process to make the production environment match the described state in the repository.

The broker is implementing (1) by creating and maintaining an updated set of instance.yml files for services ordered from a Platform that speaks OSB API.
The automated process (2) can then be built with the tools your GitOps Team chooses.

The actual provisioning is done via a CI/CD pipeline triggered by changes on the git repository, using Infrastructure-as-Code (IaaC) tools that are made for service provisioning and deployment like terraform.

Using git might be a limiting choice for service owners who expect frequent concurrent orders. But from our experience, the majority of service brokers are called more like once an hour than once a second - even at large companies.

Configuration of the Unipipe Service Broker

You can look up everything you need to get started on our Unipipe Service Broker page.

The custom configuration of our generic broker can be done via environment variables. The following properties can be configured:

  • GIT_REMOTE: The remote Git repository to push the repo to
  • GIT_LOCAL-PATH: The path where the local Git Repo shall be created/used. Defaults to tmp/git
  • GIT_SSH-KEY: If you want to use SSH, this is the SSH key to be used for accessing the remote repo. Linebreaks must be replaced with spaces
  • GIT_USERNAME: If you use HTTPS to access the git repo, define the HTTPS username here
  • GIT_PASSWORD: If you use HTTPS to access the git repo, define the HTTPS password here
  • APP_BASIC-AUTH-USERNAME: The service broker API itself is secured via HTTP Basic Auth. Define the username for this here.
  • APP_BASIC-AUTH-PASSWORD: Define the basic auth password for requests against the API

The expected format for the GIT_SSH-KEY variable looks like this:

Hgiud8z89ijiojdobdikdosaa+hnjk789hdsanlklmladlsagasHOHAo7869+bcG x9tD2aI3...ysKQfmAnDBdG4=

Deployment using Docker

We publish generic-osb-api container images to GitHub Container Registry. These images are built on GitHub actions and are available publicly

$ docker pull

Deployment to Cloud Foundry

In order to deploy the Unipipe Service Broker to Cloud Foundry, you just have to use a configured manifest file like this:

- name: generic-osb-api
    memory: 1024M
    path: build/libs/generic-osb-api-0.9.0.jar
        GIT_REMOTE: <https or ssh url for remote git repo>
        GIT_USERNAME: <if you use https, enter username here>
        GIT_PASSWORD: <if you use https, enter password here>
        APP_BASIC-AUTH-USERNAME: <the username for securing the OSB API itself>
        APP_BASIC-AUTH-PASSWORD: <the password for securing the OSB API itself>
./gradlew build # build the jar of the Generic OSB API
cf push -f cf-manifest.yml # deploy it to CF

Communication with the CI/CD pipeline

As the OSB API is completely provided by the Unipipe Service Broker, what you as a developer of a service broker have to focus on is building your CI/CD pipeline. An example pipeline can be found here.

To learn more about the meshcloud platform, please get in touch with our sales team or book a demo with one of our product experts. We're looking forward to get in touch with you.

Multi-Cloud Monitoring: A Cloud Security Essential

This is an introduction to cloud monitoring: If you work as a cloud operator or developer or you want to learn about cloud monitoring - this blog post is for you!

In this post you will learn:

  • What cloud monitoring is
  • How it helps you secure business success
  • How monitoring and alerting connect
  • About different types of monitoring
  • How Prometheus and cAdvisor work

Let's get started with the basics!

Cloud Monitoring: Definition and Challenges

Monitoring helps you understand the behavior of your cloud environments and applications.
Technically speaking, in IT, monitoring refers to observing and checking the state of hardware or software systems. Essentially to ensure the system is functioning as intended on a specific level of performance.

Monitoring in cloud environments can be a challenging task. Since there is no control over all layers of the infrastructure, monitoring becomes limited to upper layers depending on the cloud service model. Besides, cloud consumers frequently use containerized applications. Containers are intended to have short lives, even if they did last for long, we don’t rely on them e.g. for storing data. Since their nature is dynamic monitoring them is challenging. Tools such as Prometheus with cAdvisor take care of this challenge. More on that in the two bonus sections at the end of this blog post.

Five reasons why cloud monitoring helps business success

Here are five reasons why good monitoring helps you secure business success:

  1. Increase system availability: Don't let users take the place of proper monitoring and alerting. When an issue occurs on a system that is not being monitored, it will most certainly be reported by the users of that system. Detect problems early to mitigate them, before a user is disrupted by them.
  2. Boost performance: Monitoring systems leads to a more detailed understanding. Flaws become visible and Developers can gain detailed access and fix problems for better performance.
  3. Make better decisions: Detailed insight into the current state of a system allows more accurate decision-making based on actual data analysis.
  4. Predict the future: Predicting what might happen in the future by analyzing historical data is very powerful. An example is so-called pre-emptive maintenance; performing maintenance on parts of the system that have a high probability of failing soon, given the historical data provided.
  5. Automate, automate, automate: Monitoring highly reduces manual work. There is no need to manually check system components when there is a monitoring system doing the checks instead.

Monitoring and Alerting

Monitoring is usually linked to alerting. While monitoring introduces automation by pulling data from running processes, alerting adds even more automation by alerting developers when a problem occurs.

For example: Alerting if a critical process stops running.

Another important reason to monitor is conforming to Service Level Agreements (SLA). Violating the SLA could lead to damage to the business and monitoring helps to keep track of the agreements set in the SLA.

The Different Types of Monitoring

To classify types of monitoring we can ask two questions:

What is being monitored?


How is it being monitored?

To the first question there are many answers:

  • Uptime monitoring: As its name suggests, this is important to monitor service uptime.
  • Infrastructure monitoring: In the cloud world, infrastructure varies from traditional infrastructure in that resources are software-based; i.e. virtual machines and containers. And it is important to monitor these resources since they are the base of running processes and services.
  • Security monitoring: Security monitoring is concerned with SSL certificate expiry, intrusion detection, or penetration testing.
  • Disaster recovery monitoring: Also, taking backups for stored data is always an important and necessary practice. Monitoring the backup process is important to ensure it was done properly at its intended timeframe.

Now to the second question: How it is being monitored?

This lets us differentiate between Whitebox and Blackbox monitoring:
Illustration of whitebox and blackbox monitoring. Credits to for the illustration idea.

Whitebox monitoring: This type refers to monitoring the internals of a system. When monitoring applications, the running process also exposes information about itself which makes it visible to the outside world. Exposed information can be in a form of metrics, logs, or traces.

Blackbox monitoring: This type refers to monitoring the behavior of an object or a system usually by performing a probe (i.e. sending an HTTP request) and checking the result such as ping to check the latency of a request to a server. This type does not check any internals of the application.

The concept of white box and black box is used in software testing with semantically similar meaning as in monitoring. It is also concerned with testing the internals and externals of a software system. The difference being, that software testing usually occurs during development while monitoring is applied when the software is already running.

4 Tips for monitoring cloud security

Correct monitoring will tell you if your cloud infrastructure functions as intended while minimizing the risk of data breaches.

To do that there are a few guidelines to follow:

  • Your monitoring tools need to be scalable to your growing cloud infrastructure and data volumes
  • Aim for constant and instant monitoring of new or modified components
  • Don't rely on what you get from your cloud service provider alone - you need transparency in every layer of your cloud infrastructure
  • Make sure you get enough context with your monitoring alerts to help you understand what is going on

You can and should monitor on different layers (e.g. network, application performance) and there are different tools for doing this. SIEM (Security Information and Event Management) tools collect data from various sources. They process this data to identify and report on security-related incidents and send out alerts whenever a potential risk has been identified.

Bonus 1: Prometheus Architecture

As promised a short excursion to Prometheus:

Prometheus is a metric-based, open-source monitoring tool written in Go. It is the second graduating project after Kubernetes was adopted by the CNCF and will remain fully open-source. Prometheus has its own query language called PromQL which is powerful for performing operations in the metric space. Prometheus also uses its own time-series database (TSDB) for storage.

Prometheus architecture illustration

Prometheus uses service discovery to discover targets or can use statically defined targets as well. It scrapes those targets which are either applications that directly expose Prometheus metrics through Prometheus client libraries or with the help of exporters that translate data from third-party applications into metrics that can be scraped by Prometheus.
While Prometheus has its own time-series storage in which the scraped metrics are stored, it can also use these stored time-series data to evaluate alert rules. Once a condition is met, alerts are sent to Alertmanager which in turn sends a notification to a configured destination (Email, PagerDuty, etc.)

Prometheus time-series data can also be visualized by third-party visualization tools such as Grafana. These tools leverage Prometheus query language to pull time-series data from Prometheus

Bonus 2: Container Monitoring using cAdvisor and Prometheus

cAdvisor (Container Advisor) is a tool to tackle the challenge of monitoring containers. Its core functionality is making the resource usage and performance characteristics of containers transparent to their users. cAdvisor exposes Prometheus metrics out of the box. It is a running daemon that collects, aggregates, processes, and exports information about running containers. cAdvisor supports Docker and pretty much every other container type out there.

To get started you'll need to configure Prometheus to scrape metrics from cAdvisor:

- job_name: cadvisor
  scrape_interval: 5s
  - targets:
    - cadvisor:8080

Create your containers - Docker for example - that run Prometheus, cAdvisor, and an application to see metrics produced by your containers, collected by cAdvisor, and scraped by Prometheus.

Authors: Mohammad Alhussan and Wulf Schiemann

To learn more about the meshcloud platform, please get in touch with our sales team or book a demo with one of our product experts. We're looking forward to get in touch with you.

Featured image symboling a software bug

Testing Infrastructure as Code (IaC): The Definitive Guide 2020

In this blog post we're going to explain if and how Infrastructure as Code should be tested. We'll illustrate 5 examples with Terraform - the tool we use here at meshcloud - and tell you what to look for in IaC test tooling.

Here are the topics we will touch on. Let's dive right in:

1. What is Infrastructure as Code?

2. Do I Even Need to Test IaC?

3. The IaC Testing Usefulness Formula

4. The Developer Effect

5. The 3 Essential Types of Testing Infrastructure as Code

6. 4 Practical Examples for IaC Testing

7. The 3 Key Factors for Choosing Your Test Tooling

What Exactly is Infrastructure as Code?

To make sure we are starting from the same point: What do we mean by Infrastructure as Code (IaC)? IaC describes the specification of required infrastructure (VMs, storage, networking) in text form. We define a target state, which can be easily adapted, duplicated, deleted and versioned. IaC relies on modern cloud technologies and enables a high degree of automation. With these new trends, infrastructure development has become much more similar to application development. This raises the interesting question if and how you can test infrastructure code?

Do I Need to Test IaC at All?

The question is more relevant than you might think. Modern IaC tools like terraform or pulumi already have a lot of checks built-in and respond with detailed error messages. In order to better asses the usefulness of IaC testing for you, consider these factors:

  1. Number of components (e.g. vms, managed services, loadbalancers, etc.)
  2. Number of environments (e.g. dev, stg, prod)
  3. Number of Rollouts which affect Infrastructure (daily releases vs. monthly)

The IaC Testing Usefulness Formula

We believe all three factors - components, enviroments and number of rollouts - are equally important, which results our IaC Testing Usefulness Formula:

Usefulness of IaC Testing = Components x Environments x Rollouts

The Developer Effect

In practice we observed another effect which significantly increases the number of rollouts. Lovingly dubbed the "devEffect" at meshcloud it occurs during the initial set-up of IaC or the modification of existing components. Because developing new IaC is often done in a trial and error manner, developers will rollout their changes around 100 times more often. For this reason alone it can be worth it to write IaC tests.

The 3 Essential Types of Testing Infrastructure as Code

  1. Static and local checks
  2. Apply and destroy tests
  3. Imitations tests

1. Static and Local Checks

Strictly speaking these are not tests, but simple checks. However we still want to include them in this list as they are an easy and fast way to ensure the correct set-up of your infrastructure.

Idea: Analyze configurations and dependencies before execution.

Goal: Fast detection of static errors. Errors that occur dynamically during execution are not covered.

Examples: Completeness check, field validation, reference analysis.

But everything is better with code examples, so let's have a look. Consider the following terraform file to create a small VM.

resource google_compute_instance test {
  name         = "hello-terratest"
  machine_type = "f1-micro"

  boot_disk {
    initialize_params {
      image = "ubuntu-1804-lts"

  network_interface {
    network = "default"
    access_config {}

The alert reader already noticed the absence of the machine_type field. You didn't? That is exactly our point: You don't have to. Terraform automatically informs us about the missing field.

Example of a Terraform Error Message: Missing required argument. The argument "machine_type" is required, but no definition was found.
Terraform nicely outputs an error message.

This can be taken even a step further in the form of autocomplete. Using the terraform language server and VSCode for example, it automatically prompts me to enter the missing field and its type.

Screenshot showing the terraform autocomplete feature in VSCode.
Using the terraform language server helps avoiding errors.

Additionally we can also check for static dependencies. Consider the following terraform code. We assume this is the only code in the module. It references a google compute target pool which does not exist within the context. This is a total lifesaver in projects with hundreds of dependencies.

resource google_compute_forwarding_rule test {
  name   = "hello-terratest"
  target = google_compute_target_pool.test.self_link
Screenshot depicting resulting terraform error: Reference to undeclared resource.
Terraform identifies the undeclared resource.

While static checks are already a big step-up from traditional infrastructure deployments, they do not account for no dynamic fields and dependencies. This is where our next category Apply and Destroy Tests comes into play

2. Apply and Destroy Tests

With Apply and Destroy Tests we can go one step further and also identify dynamic errors.

Idea: Roll out the infrastructure for a short time, test it and destroy it again immediately afterwards.

Goal: Check dynamic fields and dependencies.

Examples: IP addresses, IDs, random generated passwords.

For these type of tests we are currently working with terratest. Terratest is a collection of go libraries that simplify the interaction with IaC providers and well-known cloud providers. For the following example we are using the terraform module. We can see that terratest easily integrates into terraform and is able to send commands and extract output from them.

package test

import (


func TestApply(t *testing.T) {

    tfOptions := &terraform.Options{
        TerraformDir: "./terraform/apply_test",

    defer terraform.Destroy(t, tfOptions)

    terraform.InitAndApply(t, tfOptions)

    // further validations go here


Up- and downsides to consider:

Read more

Why we're sponsoring the Dhall Language Project

We're very happy to announce that meshcloud is the first corporate sponsor of the Dhall Language Project via open collective. In this post I want to explain how we came to use Dhall at meshcloud, what challenges it solves for us and why we hope it will play a role in enabling software projects to more easily adapt to multi-cloud environments.

Enabling DevOps at scale

At the beginning of this year, we realized we had a challenge scaling configuration and operations of our software for customers. meshcloud helps enterprises become cloud-native organizations by enabling "DevOps at scale". Our tool helps hundreds or thousands of DevOps teams in an enterprise to provision and manage cloud environments like AWS Accounts or Azure Subscriptions for their projects while ensuring they are secured and monitored to the organization's standards.

Enabling DevOps teams with the shortest "time to cloud" possible involves the whole organization. Our product serves DevOps teams, IT Governance, Controlling and IT Management in large enterprises. That means meshcloud is an integration solution for a lot of things, so we need to be highly configurable.

Because we also manage private clouds (OpenStack, Cloud Foundry, OpenShift etc.) we often run on-premises and operate our software as a managed service. This presents unique challenges for our SRE team. Not only do we need to maintain and evolve configuration for our growing number of customers, but we also need to support deploying our own software on different infrastructures like OpenStack, AWS or Azure[1].

At the end of the day, this boils down to having good and scalable configuration management. After going through various stages of slinging around YAML with ever more advanced tricks, we realized we needed a more fundamental solution to really crack this challenge.

Configuration management at scale - powered by dhall

The Dhall configuration language solves exactly this problem. It\'s a programmable configuration language that was built to express configuration - and just that. Dhall is decidedly not turing complete. It\'s functional nature makes configuration easy to compose from a set of well-defined operations and ensures that configuration stays consistent.

Using Dhall allows us to compile and type check[2] all our configuration for all our customers before rolling things out. We use Dhall to compile everything we need to configure and deploy our software for a customer: Terraform, Ansible, Kubernetes templates, Spring Boot Config. We even use Dhall to automatically generate Concourse CI pipelines for continuous delivery of our product to customers.

Since adopting Dhall earlier this year, we measurably reduced our deployment defect rate. We feel more confident about configuration changes and can safely express configuration that affects multiple services in our software.

Empowering a Multi-Cloud Ecosystem

We believe that open-source software and open-source cloud platforms are crucial for enabling organizations to avoid vendor lock-in. Now that mature tools like Kubernetes exist and can do the heavy lifting, enabling portability between has become a configuration management challenge.

What we found especially interesting about Dhall is that it\'s not just an "incremental" innovation atop of existing configuration languages like template generators, but instead looks at the problem from a new angle and tries to solve it at a more fundamental level. This is something we can relate to very well as we\'re trying to solve multi-cloud management using an organization as code (like infrastructure as code) approach.

That's why we\'re happy to see Dhall innovating in this space and reached out to the Dhall community to explore ways we can support the project. We hope that providing a steady financial contribution will allow the community to further evolve the language, tooling and its ecosystem.


  • [1]: In this way meshcloud is not only a multi-cloud management software but is also a multi-cloud enabled software itself.

  • [2]: Dhall purists will want to point out that expressions are not compiled, instead they\'re normalized.