Gotham Online Migration Framework

Published in

Palantir Blog

12 min readSep 23, 2021

There’s one concern that unites software engineers across the globe, regardless of project, industry or tenure: how do I best deploy and manage updates to my software? From new features to bug fixes to performance improvements, we want to make sure our products are running optimally. This question is particularly pertinent at Palantir, where we deploy software to some of the most remote locales on Earth to solve some of the world’s hardest problems.

Apollo, the platform that powers both Foundry and Gotham, manages the complexity of all our environments and safely and consistently delivers product updates to our customers live without disrupting users on the system; our Online Migration Framework complements Apollo by ensuring that services can safely manage schema and data migrations through Apollo’s upgrades.

In the days before I wrote this post, I wrote, tested, and shipped a data migration using the Online Migration Framework. It only took me about two hours. Without the framework, it likely would have taken days and required much more careful thought and planning — and possibly caused bugs in the field, as even a fairly simple data migration is still a complex problem to solve without the framework.

In this blog post, I’ll talk about:

Why Palantir uses online upgrades over offline ones
The challenges of and common strategies for online upgrades
How Palantir uses the Online Migration Framework to streamline online upgrades across data scale, deployment and execution scale, and development scale.

Why Palantir Uses Online vs. Offline Upgrades

There are two common strategies in the industry for changing the structure of stored data: offline upgrades and online upgrades.

You’re probably familiar with offline upgrades. The device you’re reading this on likely upgrades this way. When a new version of the device’s software becomes available, the device prompts you to save your work and shut down to perform upgrades. Once the upgrade is complete, you start your device back up again and are using the latest and greatest software version.

From a software development standpoint, offline upgrades are fantastic. They are relatively simple to implement and easy to understand. For this reason, offline upgrades are often the preferred strategy if you have the luxury of doing so.

Palantir does not have this luxury. Our software supports 24/7 mission-critical work around the world, often in life-and-death situations. The system must be available at all times, so we need to perform online data migrations.

The Challenge of Online Upgrades

Online data migrations are much trickier than offline ones. If your software needs to upgrade live, you need the new version to be up and available before taking down the old version (Apollo supports rolling blue-green upgrades). The old code wants to use data in the old format, and the new code wants to use data in the new format.

For a less technical example, imagine trying to raise the height of a table. An offline migration would be reasonably straightforward. Take the legs off, replace them with longer legs, and stand the table back up again. But what if the table needs to be used while you’re doing this? You’d need to carefully replace the legs one at a time. But how exactly will you fit the new longer legs under the table while the shorter legs are still supporting it?

Of course, Palantir is not the only company whose software has high uptime demands; many companies need to perform online data migrations on large-scale data. The standard strategy for doing so is nicely described in the Stripe tech blog. We used a very similar strategy when upgrading Gotham with AtlasDB to serve as the transactional layer for data. In summary, the strategy is as follows:

Create the destination for the data to be migrated (e.g., a new column or table in a relational database).
Migrate the data from the old format to the new format. While the migration is in progress, code should read from the old format and write to both the old and new formats.
Once the migration is done, code can read and write the new format only, and the old format can be deleted (e.g., drop no-longer-used columns or tables).

This approach is pretty straightforward at a high level. However, like the analogous solution in the table metaphor (i.e., building extendable table legs that you extend only after replacing each leg one by one), getting the details right poses a much more significant technical challenge than offline migrations. Engineers can spend weeks or even months planning, developing, and carefully executing such a migration for their production stack. If you have one or just a few production stacks where this migration needs to run and your engineers have access to them, this is usually a manageable project.

Once again, Palantir does not have this luxury. Our software is deployed to hundreds of environments, including on and under the water, in the air, and in remote outposts far from home. Even if we could somehow manage to upgrade these stacks using the standard strategy, the approach would not scale. We need to automate the execution of our migrations.

Furthermore, Palantir moves quickly. We’re constantly improving our software and developing new capabilities, necessitating frequent data migrations. Some of our fastest-moving projects require significant data migrations almost monthly. Spending a month every month developing a data migration is devastating to forward progress.

Benefits of the Online Migration Framework

The Online Migration Framework is our solution to the challenge of live data migration. This framework is effective because it scales across data, deployment and execution, and development.

To help make this framework concrete, I’ll be using a running example of a classic migration: changing the storage of a name from a single last, first column to two separate last_name and first_name columns, as shown below. I’ll be writing in pseudo-code to give an idea of what the development of this migration might look like.

Writing Migrations

The first thing the Online Migration Framework provides is a way for developers to write migrations. This is broken into three parts that should look familiar:

Pre-migration schema additions: Developers create tables, columns, and any other schema elements necessary to support the new format of the data.

CREATE TABLE Users_2 (
   user_id VARCHAR2(128),
   last_name VARCHAR2(128),
   first_name VARCHAR2(128),
   CONSTRAINT pk_users_2 PRIMARY KEY (user_id)
);

Migration batches: Developers write code to migrate data from the old format to the new format in batches. The framework also provides a data store for migrations to store their progress during this phase, since it’s possible at any point for the process running the migration to fail. We don’t want to have to start a long migration from scratch.

...
List<Users2Record> userBatch = sql.select(Users.USER_ID, Users.FULL_NAME)
    .from(Users)
    .where(Users.USER_ID.greaterThan(getLastProcessedId()))
    .orderBy(Users.USER_ID.asc())
    .limit(batchSize)
    .fetch(row -> createUsers2Record(row.get(Users.USER_ID), row.get(Users.FULL_NAME)));
if (userBatch.size() > 0) {
    List<Query> inserts = userBatch.stream()
        .map(record -> sql.insertInto(Users_2).set(record).onDuplicateKeyIgnore())
        .collect(Collectors.toList();
    sql.batch(inserts).execute();
    updateLastProcessedId(userBatch.get(userBatch.size() - 1).getUserId());
} else {
    markMigrationCompleted();
}
...

Post-migration schema drops: Developers drop any no-longer-needed schema elements and the data they contain.

DROP TABLE Users;

Because, as mentioned, the process running these migrations could fail at any time, each of these sections must be idempotent, meaning that running them multiple times is safe and has the same effect as running them once (for the DDL statements, it suffices to catch any exception returned by the data store and check if it is the expected type; e.g., the table already exists).

Supporting Rollback

These migrations also need to support rollback. The fastest way to mitigate an issue is to return to the last known working state. This often means downgrading the software to an earlier version. But what if a migration was in progress? The old version of the code may not know how to handle this state, so we need to roll the migration back. Because the only reason to roll back a migration is to mitigate an issue, we require rollback to be fast. In fact, the only place to implement it is as part of step one above, hinting that rollback code should be as fast as dropping the new columns or tables you created.

Updating Code to handle In-Progress Migrations

Developers also need to update the code that accesses this data in order for the code to work correctly before, during, and after the migration. To support this, we’ve built a Migration Proxy. Developers can put their data access code behind an interface and annotate the interface’s methods with @Read or @Write. They can then implement this interface once for the new state of the data and once for the old state. Along with the Write annotation, developers can also specify what the behavior should be while the migration is in progress:

Writing to both implementations (default).
Writing to only the new implementation — -for example, when adding a column to a table where the new write is a superset of the old write.
Writing only to the old implementation — for example, when the migration just drops a column where the old write is a superset of the new write.
(Rare) Writing to a “middle” implementation that provides custom handling when a migration is in-progress.

In the example:

public interface UserStore {
  @Read
  User loadUser(String userId);  @Write
  void storeUser(User);
}

This sounds like it could be a lot of work, but we’ve developed a process where developers keep a primary implementation up to date with the latest data format and subclass this implementation, overriding only the methods that need to be overridden for each migration going backward. As this suggests, this process can be repeated for multiple migrations affecting the same data, and the Migration Proxy will accept a list of implementations and the migrations that go between them. From this list, it will choose the correct implementation(s) to use based on the state of each migration. In our example, a Migration Proxy for the UserStore would be created like so:

UserStore proxiedUserStore = SimpleMigrationAwareProxy.wrap(
    UserStore.class,
    new LegacyUserStoreImpl(dbSession),
    new UserStoreImpl(dbSession),
    USER_STORE_MIGRATION_TASK_ID,
    dbSession);

Coordinating Migrations through the Migration Runner

Each of these migrations, at commit time, get a monotonically increasing ID (e.g USER_STORE_MIGRATION_TASK_ID) and will be run by our Migration Runner in order. By having migrations run in the same order in every environment, we’re able to keep the number of possible states of the data store to a manageable level and also look at the migration state of any service and immediately know what the state of the data store is even without inspecting it.

As it runs migrations, the Runner coordinates with Migration Proxies by grabbing a write lock on a migration’s status whenever it changes it. Meanwhile, the Proxy grabs a read lock on the status for the duration of the operation it’s performing. Critically, this allows large-scale migrations to run safely with minimal impact on the rest of the code because multiple threads can hold the read lock while doing data access operations. The Runner only needs to hold the write lock for the brief moments where it is changing the migration status in the data store, not during any actual migration execution.

When the Runner starts up, it executes a consensus protocol with the other nodes of the service to determine what migrations are known by all nodes. This is important because as we continually deliver new software in ever-shorter development cycles, we will more and more frequently be in situations where a service may have multiple nodes with heterogeneous versions of software deployed. The Runner needs to determine what migrations are supported by all nodes so it can continue to make progress through the series of migrations even when a service is receiving frequent rolling upgrades via Apollo.

Employing Soak Time and Migrating in Stages

In order to expedite rollbacks, we employ a configurable “soak time” (default of four days) after a migration becomes available before we start running it. The Migration Runner tracks each individual migration when it first becomes available. After soaking, the Runner begins its work. Running a migration proceeds in several stages, more numerous than the three parts mentioned above, to handle the Runner failing at any point in execution.

Uninitialized: The migration has not yet begun, and all Proxies read and write the old data format.
Rolling Back: The migration is in the process of rolling back to the Uninitialized state but may have yet to finish this.
Initializing: The Runner is executing pre-migration schema additions but may not have completed them. Rollback will need to run if we need to downgrade. Proxies still read and write the old format.
Running: Pre-migration schema additions are complete, and the migration has (maybe) started running its batches. Proxies read the old format and write both formats (or just old or new based on developer-specified behavior).
Awaiting Additional Action (optional): The migration has completed, but there is action outside of the framework’s control that needs to be taken. For example, new data may need to be indexed into Elasticsearch or some other secondary index before the migration can be considered fully complete. Proxies now begin reading the new format but still write to both to support rollback.
Awaiting Finalization: The migration is complete. Proxies read the new format but still write to both to support rollback.
Finishing: The Runner is executing post-migration schema drops but may not have finished. The migration can no longer be rolled back. Proxies read and write only the new format.
Finished: The migration is complete and all old schemas and data have been dropped.

Developers must write idempotent migration behavior, so if the Runner fails at any point, we can restart that stage and keep moving along without issue. We also employ an additional soak time between Awaiting Finalization and Finishing. This allows us to identify issues that only become apparent once the new data format is available and address those issues before we are no longer able to roll back to the original state.

Managing the Migration Lifecycle

After a migration has been completed, we generally want to remove the migration code. But when exactly is a migration complete? It has to run on potentially hundreds of separate environments. Some migrations take less than a second, while others can take weeks.

To have a clear point at which migration code can be removed, we designate specific releases as “checkpoints.” To install upgrades beyond a checkpoint, a service must complete all the migrations up to that checkpoint. Apollo notes (1) which schema versions each release supports and (2) the current schema version of the live service based on what migrations have run. For the releases beyond a checkpoint, we change the minimum schema version to the version of the last migration in the checkpoint so that Apollo can enforce this requirement. Apollo can also require that migrations are rolled back before downgrades, since the old binary will report a maximum schema version that is only as high as the migrations it knows about.

This post-checkpoint cleanup is often relatively trivial because of the above-mentioned multiple implementations supporting both old and new data formats. Because we already have the complete implementation of the data access code for all completed migrations, we can delete the old subclasses that supported old versions and in-line the latest implementation rather than having it wrapped in a Proxy. This allows us to scale our development efforts while keeping our codebase neat and tidy. Developers need only minutes to clean up after a checkpoint.

Testing Migrations Across the Platform

We’ve also built a comprehensive testing framework for our online migrations. Migration developers can leverage a simple JUnit test extension that provides a custom Migration Runner for the migration they are testing and allows easy roll-forward and roll-back testing. We use this same extension framework for more comprehensive testing across the entire product. When a new migration is committed, our build system will generate tests for every combination of data store we support. It will also generate tests for the four migration states (Uninitialized, Running, Awaiting Finalization, and Finished) and run our tests across all those combinations to ensure our products work throughout all phases of the migration.

Finally, the Online Migration Framework is service- and datastore-agnostic and has been implemented across a number of our services and several different data stores. We’ve developed a suite of abstract test classes that can be implemented with a service’s implementation of the framework to quickly test that the implementation is working properly. This allows us to quickly scale our ability to safely and sustainably migrate data across our product ecosystem.

The Online Migration Framework at Scale

The Online Migration Framework has greatly improved our ability to continually deliver improvements and new features to our customers. Over the last couple years, dozens of developers have written several hundred migrations. Overall, we’ve run approximately 15,000 online data migrations across the fleet without disruption to users. The Online Migration Framework has even been extended to uses beyond data migration and has had several common migration types built into it so developers may only need to write ten lines of code to complete a migration, allowing them to focus on delivering the best technology to our customers.

Author

Michael Harris, Software Engineer, Gotham Delivery

Interested in working on technical challenges like scalable online data migration? We’re hiring. Reach out to us at palantir.com/careers.