The Benefits of Running Kubernetes on Ephemeral Compute

Palantir
Palantir Blog
Published in
7 min readSep 1, 2022

--

This is the third post in our blog series on Rubix (#1, #2), our effort to rebuild our cloud architecture around Kubernetes.

Introduction

The advent of containers and their orchestration platforms, most popularly Kubernetes (K8s), has familiarized engineers with the concept of ephemeral compute: once a workload has completed, the resources used to run it can be destroyed or recycled. Ephemeral compute constructs, such as K8s pods, make it possible to:

  • Run immutable workload configurations derived from a common template → this immutability simplifies the debugging of misbehaving workloads
  • Implement upgrades as replacing workloads → this avoids the complexities of implementing in-place updates

The concept of ephemeral compute also positively impacts engineering discipline. Because pods might get terminated at any time, engineers must scrutinize an application’s resiliency at all stages of development.

The Kubernetes pod is often the only level at which engineers interact with the concept of ephemeral compute. When first designing Rubix, Palantir’s Kubernetes-based infrastructure platform, we asked ourselves: what value can we get from applying the concept of ephemeral compute further down the stack?

This post covers Rubix’s commitment to managing K8s clusters composed of immutable, short-lived K8s nodes, or “compute instances.” We discuss our conviction in this approach, its attendant challenges, and how it has made production Rubix environments more resilient and easier to operate.

Why Ephemeral Compute Instances?

Managing a Kubernetes cluster composed of short-lived nodes sounds hard. So, why bother? We can gain conviction in the value of this endeavor by recounting the benefits of a more familiar ephemeral compute construct: the Kubernetes pod. These benefits boil down to a pod’s immutability:

  • The pod spec suffices for understanding a pod’s workload
  • The problem of upgrading “in place” goes away — upgrades happen via replacement instead

We can realize similar benefits by treating K8s nodes as ephemeral compute instances. Immutable nodes derived from instance templates simplify the debugging of any node issues encountered in production. The ability to destroy and replace nodes removes the need for upgrading instances in place.

In addition, we added the constraint that nodes in Rubix environments cannot live longer than 48 hours. There are several benefits of this constraint:

  • Reduced need for manual actions from cluster admins: Problematic or outdated instances get replaced automatically, and the logic that drives this replacement contains encoded learnings from support issues.
  • Improved security posture: Compromising a single node is insufficient for an attacker to gain persistent access to an environment.
  • Enforcing engineering discipline: Pod disruptions are no longer just possible; they’re guaranteed. Engineers must treat high availability as critical when developing applications.
  • Feature pathing: Routinely launching and terminating instances without disrupting production environments is a building block of cluster autoscaling. Engineering efforts succeed when they’re decomposed into narrowly scoped problems.

Having convinced ourselves that running K8s clusters composed of ephemeral nodes has such upside, let’s discuss the problems we had to solve to achieve this goal.

Gracefully Draining and Terminating Nodes

Terminating instances every two days makes it challenging for applications running in Rubix to maintain availability. This requires developers to write code with high availability in mind, but our infrastructure must still ensure that instance terminations don’t completely destabilize hosted services.

We solved for this by developing a termination pipeline with three components:

  • Policy-driven node selection, which considers criteria such as node age and health
  • Recording node selection decisions via node labels and annotations
  • Termination logic, which handles the draining and final termination of the K8s node and its associated cloud instance

This decomposition provides us both operational and stability benefits.

Operationally, our node selection policy abstraction allows us to encode learnings from cluster operations. Upon identifying categories of bad node states in production, we define policies to automate the termination and replacement of such nodes. For example, the old AWS EBS volume plugin (since deprecated in favor of the EBS CSI driver) applied NoSchedule taints to a node if it detected the node had volumes stuck in an attaching state. The Rubix team added a termination policy selecting nodes with such taints, removing the need for manual intervention in future instances of the same issue.

From a stability perspective, our termination logic accounts for application availability through its use of eviction APIs, respecting any relevant PodDisruptionBudgets (PDBs) during node draining.

Readers are likely already familiar with PDBs and evictions; the novelty of our termination pipeline solution is the encoding of policies to relieve environment operators of the burden of manually identifying and replacing nonfunctional cloud instances.

Upgrading Instances

Before the arrival of container orchestration platforms, application upgrades were painful and error-prone. Pitfalls included:

  • Relying on aspects of the underlying machine state (for example, the JDK versions available on the machine) that could vary
  • Invoking the new binary from the wrong directory or otherwise subtly changing its invocation
  • Inheriting unwanted or obscure state from the previous process’s working directory

Engineers who have experienced these upgrades can appreciate the simplicity of a K8s Deployment rolling update. Old pods are simply destroyed and replaced with new ones, and the pod template is the sole source of configuration for each replica. Rubix’s commitment to ephemeral compute brings these upgrade semantics to the K8s node level.

Consider a routine OS upgrade. This results in a new template for our cloud instances (whether an AWS launch template or a GCP instance template). From there, expediting upgrades simply involves adding a new termination policy that prioritizes instances derived from older templates. The cloud provider then uses the new template when replacing these terminated instances.

(1) An instance group has instances running in a steady state. (2) The instance group receives a new configuration. (3) The instance group creates new instances using the new configuration. (4) The termination pipeline marks the old instances as unschedulable and evicts their pods; replacement pods get created on the new instances. (5) The old instances are terminated.

This upgrade approach doesn’t cover all classes of instance upgrades. For example, some changes to launch configurations result in cloud providers terminating instances unilaterally, obviating our graceful termination logic described above.

We addressed this by again leveraging the concept of immutability, this time at the instance group level. By making certain instance group configuration parameters immutable, we could implement instance group upgrades by replacement.

(1) A single instance group exists in a steady state. (2) An instance group with a new configuration is created. (3) The new instance group scales up. (4) Workloads are drained from the old instance group, and replacements are scheduled in the new group. (5) The old instance group is removed.

Once again, extending our application of ephemerality and immutability of compute resources has proved valuable. Similar to Deployment-managed ReplicaSets, instance groups now have no configuration history. All of an instance group’s members are derived from a single, immutable template, making them easier to debug.

Moreover, when upgrades occur on a gradual rolling basis between two distinct instance groups, it is possible to separately evaluate the two instance groups. We can now track the health of each group, helping us determine the correctness of the new instance group’s configuration and roll back if our telemetry detects issues. The example below demonstrates failed upgrade behavior using launch failure rate as our instance group health metric.

(1) A single, healthy instance group exists in a steady state. (2) A new instance group is added. (3) The new instance group experiences some launch failures when scaling up due to a bug in its configuration. (4) The infrastructure begins to move workloads to the new instance group, which continues to experience some launch failures. (5) Once the new instance group’s launch failure rate surpasses a configurable threshold, the infrastructure marks it as unhealthy and moves workloads back to the old instance group. (6) The new, failing instance group is scaled down.

Extensions

Our broad application of the concept of ephemeral compute brought us to a place where instance upgrades and rollbacks happen automatically. At this point, we asked ourselves: can we leverage this system to improve our resiliency to cloud capacity issues?

Cloud capacity issues can arise due to any number of causes, but often occur along the dimensions of availability zones (AZs) or types of instance offerings (whether an actual instance type, such as one with a particular Memory to CPU ratio, or a type of compute offering layered on top of instances, such as AWS EC2 spot). How would treating this class of capacity-related traits as immutable work?

The parameters most relevant to cloud capacity issues often have enumerable values; production environments typically span three AZs, and opting for spot instances is a binary decision. Breaking up our existing instance groups along these new dimensions thus yields a bounded number of instance groups to which we can route capacity based on instance group health.

While splitting out instance groups introduced new responsibilities (such as balancing scaling across AZs), it hardened our infrastructure by enabling our control plane to automatically route capacity to healthy instance groups when others experience cloud provider outages.

Conclusion

Palantir treated Rubix as an opportunity to extend ephemeral compute, a concept with proven production value, beyond the pod level to the K8s node level. After tackling the concomitant challenges, we realized this effort’s benefits in terms of improved security, easier upgrades, and better resiliency to cloud provider outages.

In a future post, we’ll cover how this foundation of ephemeral compute infrastructure has enabled us to tackle and rein in cloud costs. Stay tuned!

Interested in helping build the Rubix platform and other mission-critical software at Palantir? Head to our careers page to learn more.

--

--