Improving Uptime With Online Data Migrations

Improving Uptime With Online Data Migrations


When it comes to the challenge of working on problems with ever increasing numbers and intensity levels, Paul Graham is famously on record for suggesting that startups “Do things that Don’t Scale.” While this is great for saving time up front and ensuring you’re working on the most important problems at hand, it’s an unavoidable fact that given sufficient growth the solutions you put in place today will eventually break down and require redesigning. 

While Paul Graham may have been talking about business processes, the principle holds true for software development and its underlying processes as well. 

In this post, I’d like to discuss one technical process we relied on for a long time which gradually became more and more untenable: running database migrations behind a maintenance window. 

Anyone who has worked on a web app of any scale knows that the database layer is a common area for performance problems to arise. In my time at HelloSign, I’ve been a part of several issues including IO bottlenecks (AWS provisioned IOPS helped a lot), grinding queries with multiple joins (EXPLAIN is your friend), and internal metric reports which could take so long the apache process times out (simple enough, just design a tailored, denormalized table populated nightly by a worker process). But besides these fairly isolated problems there has always been an ongoing discomfort related to running data migrations because we took our app offline while they ran. 

For context, any time we want to change the underlying schema of our relational database by adding or dropping columns, changing data types, adding indexes, etc, it necessitates running a data migration. This term may evoke images of 0’s and 1’s marching south for the winter, but it actually refers to the low-level operations the database must perform to make these changes to a given table. These can be long-running events and, as you might imagine, it’s easier to do when you don’t have to worry about the data changing while the migration is running. Additionally, we also had to consider the codebase and our ORM layer which needs to be constantly in sync. 

To address both these requirements we adopted a fairly simple and common approach to our migrations - run them behind a maintenance window so we could be sure they executed in isolation. We would still do dry-runs and collect timings of course but come release day, waiting for the database tables to be reorganized was a price we continued to pay. 

Although we tried to make these changes conservatively and only when necessary, the time required for the maintenance windows continued to increase. When some of our tables topped tens of millions of rows and migrations were requiring an hour or more of downtime it was clear that our simple approach just wasn’t working any more.

Losing That Maintenance Page Ain’t So Easy

Before we could even begin to run migrations online without bringing the app down, we needed to address three primary roadblocks to performing migrations online:

  1. When running certain DDL statements in MySQL 5.5 (our version at the time) the table is locked until the operation is completed. On large tables this can take a significant amount of time and cause the app to become unavailable.
  2. Related to #1, even if we could prevent the table from being locked, we’d want to know that data added to the table during the operation would carry over into the new schema. Concurrently available migration solutions try to solve this by either modifying the table in place or by copying the table and applying triggers which keep track of changes and re-apply them to the new table when it's finished migrating.
  3. Like many web-applications, we use an ORM which insists that everything it thinks it knows about the database is true (i.e. the DB schema must be at least a superset of the ORM). Furthermore, any code in our web actions or templates needed to be compatible with the underlying data structure or we risk corruption and errors.

We knew there wasn't a single tool that would address all these issues like some magic bullet. In particular, problem #3 would be more about how we shaped and documented our internal processes than about finding some new technology to fix it for us. Some companies approach the technical challenges with a master/slave setup where they run their migrations on the slave before doing a promote/swap routing behind a maintenance window. 

Although that may have been an option for us, AWS makes a little trickier to accomplish with their RDS service, and we chose instead to look at two solutions for running online concurrent migrations: Percona’s pt-online-schema-change and MySQL 5.6 native online DDL.

A Tale of Two Migration Tools

Historically MySQL ran DDL commands (i.e. ALTER TABLE…) by locking and copying the table in question. A big step forward came with MySQL 5.1 and the InnoDB plugin where you could at least create indexes quickly without a full table copy. However it wasn’t until MySQL 5.6 was released that full Online DDL support became available for standard operations like adding & dropping columns, changing foreign key constraints, and changing auto-increment values. 

At the time of our initial investigation we were still running MySQL 5.5, which led us to look at an alternate approach. **Percona’s pt-online-schema-change seemed to be the tool of choice and was intended as a way to run non-blocking DDL statements by copying a table over in chunks and setting triggers to handle concurrent UPDATE statements. 

After looking at both tools we found MySQL 5.6’s Online DDL tended to be the better option in most cases. The reason was that Online DDL tries to alter a table in place whenever possible while pt-online-schema-change performs a full table copy. While both operations are non-locking, Percona’s approach typically leads to a slower overall operation. 

The only exceptions we’ve found to this rule seems to be:

  1. If you have a read-replica slave turned on Percona may be a better choice (because it adjusts for the replication lag)
  2. If you are running an operation Online DDL doesn’t support with concurrency (like changing data types or database engines).

Finally, As always, maintain good backups and practice your migrations in a sandbox environment under load (I recommend a battery of performance and integrity checks with concurrent migrations running using a load testing tool like multimechanize). 

**If you use Percona, you have two choices for updating tables which are referenced by children. The --alter-foreign-keys-method allows for either ‘rebuild_constraints’ or ‘drop_swap’ as arguments. ‘rebuild_constraints’ is the preferred method but if the child table is so big that the keys can’t be rebuilt in a similar amount of time to the table-copy, then it may not be a feasible choice. ‘drop_swap’ on the other hand is an irreversible operation, so use it with caution.

Get Your Devs in a Row

After we updated our Amazon RDS instance to MySQL 5.6 (which could be its own blog post, let me tell you) we turned our attention to the last challenge: organizing the migration use cases and laying out the processes necessary to make them work with online migrations. In the end, we designed a 2-3 step process that addressed the vast majority of our scenarios. 

Step 1: Pre-code-release 

Before releasing any code which uses new columns, indexes, or relationships, run these kinds of migrations first (remember, no downtime necessary for these with MySQL 5.6).

  • Add indexes
  • Add columns
  • Add new tables
  • Add foreign keys

Step 2: Release the code! 

Deploy the new code which begins to use the updated schema items you rolled out in step 1. At the same time you can also remove dependencies on old schema components (although you may need some temporary switching in place if data has changed forms or locations). For example, In the event your code may need to attempt to perform reads from the new location, and finds it unavailable, read instead from the old location. All new writes however would go to the new location. Simultaneously a background process systematically moves things over. 

It’s a classic backwards-compatible model which cleans up after itself. To achieve this, make use of your ORM’s hooks or filters such as pre-save, pre-hydrate, and other similar ones. 

Step 3 (optional, depending on the scenario): Tear down, and clean up 

After the new code is in place you can run the post-release migration which will include:

  • Remove old columns
  • Remove old indexes
  • Drop old Tables
  • Remove old foreign keys

Renaming of columns can be split into adding a new column in a pre-release followed by a drop of the old column in a post-release. 

Keep in mind your migrations will not always require all three steps. If all you’re doing is adding new functionality and data-tracking, you will likely only need to do steps 1) and 2).

Any time we have an upcoming migration we can now reference these guidelines to determine the best way to proceed. 

With this research and specification work under our belts we can continue to provide not only the best eSignature app and API around, but further improve our reliability and availability.

Explore HelloSign's API

Debugging dashboard, unlimited test calls, and the fastest integration time.

Get news, insights and posts directly in your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form
Oops! Something went wrong while submitting the form.