meshBlog

Migrating a Galera Cluster with FlywayDB

By Thomas Felix18. May 2018

Many developers rely and love the easy database migration FlywayDB provides. Some of you might also use a Galera Cluster on top of MySQL or Maria DB to provide better redundancy and high availability for your database backend. When using Cloud Foundry as the application runtime, this can cause undesired side effects:

In case multiple Cloud Foundry app instances are deployed in parallel we have observed sporadic deployment fails looking like the DB migration was the culprit. Searching the logs showed several race conditions occuring while updating the schema if two or more app instances are started simultaneously. This was unfortunate since Flyway should support a migration during the start of multiple apps as stated in the Flyway FAQ:

Can multiple nodes migrate in parallel?

Yes! Flyway uses the locking technology of your database to coordinate multiple nodes. This ensures that even if even multiple instances of your application attempt to migrate the database at the same time, it still works. Cluster configurations are fully supported. 

(https://flywaydb.org/documentation/faq)

It turns out that these locking mechanism are not fully supported by Galera:

Table Locking

Galera Cluster does not support table locking, as they conflict with multi-master replication. As such, the LOCKTABLES and UNLOCK TABLES queries are not supported. This also applies to lock functions, such as GET_LOCK() and RELEASE_LOCK()… for the same reason.

(https://mariadb.com/kb/en/mariadb-galera-cluster-known-limitations/#limitations-from-codershipcom)

That leads to the observed problems during migration. This blogpost shows an easy fix to the problem.

Since a fix via Galera seemed unlikely we needed a reliable way to safely perform database migrations. Since a single node deploy worked flawlessly, we came up with the following solution: Spring allows us to extend the FlywayMigratioStrategy. Combined with the environment variable set by Cloud Foundry we can detect in which app instance we are currently running and perform the migration only if we are the app instance ID zero (or none at all to allow migration on local developer machines).

This led us to the following code snippet fixing the problem:


class FlywayCustomMigrationConfig : FlywayMigrationStrategy {
 val maxRetries = 60
 val waitBeforeRetryMs = 1000L

 override fun migrate(flyway: Flyway) {
   val applicationIndex = System.getenv("CF_INSTANCE_INDEX").toIntOrNull()
   val shouldMigrate = when (applicationIndex) {
     0, null -> true
     else -> false
   }

   log.debug { "CF_INSTANCE_INDEX found: $applicationIndex found." }

   if (shouldMigrate) {
     log.info { "DB migration preconditions match. Performing migration." }
     flyway.migrate()
   } else {
     log.info { "Application index is $applicationIndex. Waiting for primary app instance with index 0 to perform migration." }
     waitForMigrationsOnPrimaryAppInstance(flyway)
   }
 }

 private fun waitForMigrationsOnPrimaryAppInstance(flyway: Flyway) {
   for (i in 1..maxRetries) {
     val pending = flyway.info().pending()
     if (pending.isEmpty()) {
       log.info { "Migrations completed on primary app instance, will start." }
       return
     }

     log.info { "Waiting for ${pending.size} migrations to complete on primary app instance (retried $i times)." }
     Thread.sleep(waitBeforeRetryMs)
  }
   throw TimeoutException("Exceeded $maxRetries retries waiting for migrations to complete on primary app instance.")
 }
}