How Palantir Apollo Saves Developer Time on Kubernetes

Published in

Palantir Blog

8 min readSep 23, 2022

Editor’s note: This blog post is the third in a series about Palantir Apollo, following publication of Why Traditional Approaches to Continuous Deployment Don’t Work Today and Palantir Apollo Orchestration: Constraint-Based Continuous Deployment For Modern Architectures.

Over the last decade, infrastructure platforms have grown to meet the increasing demand for using containers as the fundamental units of deploying software. The most popular of these platforms, Kubernetes (K8s), supports a vast array of primitives for managing containerized workloads. It is also overwhelmingly complex. Getting even a simple application running requires knowledge and coordination of several K8s resources. Forcing developers to reckon with such concepts can eat away at their productivity and can weaken an organization’s ability to standardize infrastructure usage.

At Palantir, we recognized it was critical to enable developers to deploy on K8s without grappling with all its complexities. We tackled this challenge by defining the Apollo Product Specification, guaranteeing that following this spec would ensure an application’s deployability. Two core Apollo components cement this specification’s value. The Apollo SDK helps developers that leverage the declarative API of the Apollo Product Spec, while the Apollo Service Management Plane converts these applications to functioning sets of Kubernetes native resources.

This approach has resulted in a simpler, repeatable deployment experience for our developers. Just as importantly, it has enabled our deployment platform to evolve over time, handling K8s API deprecations/removals as well as expediting new feature adoption. In this post, we further detail the problems organizations face when deploying on Kubernetes, the roles of the Apollo SDK and Service Management Plane, and examples of the value this has unlocked for Palantir.

Problem Statement

Deploying an application to Kubernetes is… hard. While the platform offers solid primitives, great extension points, and thorough documentation, the task of deploying even a simple application poses a daunting series of questions:

How do I use a ConfigMap to provide my application with configuration? What if my application needs IAM credentials? What are pod controllers, and which one should I use? If my application needs to store data, how do I set up that storage? What are pod disruption budgets, node selectors, pod (anti)-affinities, and topology spread constraints, and should I be using any of them? How does Kubernetes pod networking even work?

The list goes on... Faced with all this complexity, developers often waste time climbing a steep learning curve or just copy a working example from another application’s repository. Over time, this can result in each piece of a service-oriented architecture having its own bespoke deployment story, with several accompanying ill effects:

K8s resource usage is application-specific — developers make mistakes and learn from them in isolation, and the lack of a deployment standard complicates debugging production issues.
K8s upgrades involving deprecated API removals require an organization-wide effort to review all applications’ deployment approaches.
New K8s feature adoption is a slow, per-team effort.

K8s’ lack of higher-order resource kinds can prevent developers from engaging in simple, consistent ways when deploying. At Palantir, we recognized the need to encapsulate this complexity and present developers with a simple, declarative way of defining and deploying their applications. To that end, we defined an Apollo Product Specification — a standard which applications can satisfy in return not just for a guarantee of deployability, but a simple way to opt into K8s’ powerful features. We also built the Apollo SDK, a set of tooling to facilitate the development of applications compliant with this spec. Finally, we built the Apollo Service Management Plane to do the heavy lifting of taking a spec-compliant application and turning it into a working set of Kubernetes resources.

These Apollo components have protected developers from deployment complexity while also enabling them to leverage powerful Kubernetes features via simplified primitives.

The Apollo Product Spec and SDK

We had two fundamental goals when building the Apollo Product Spec and SDK:

Enable developers to define the fundamental characteristics of an application in simple terms (statefulness, openness to external network traffic, IAM credentials, etc.).
Provide deployability guarantees at development time, shielding developers from dealing with the quirks of infrastructure setups where their software is deployed.

To that end, the Product Spec encompasses configuration points for developers to state the basic requirements, or traits, of an application. These configuration points are concise and declarative — for example, requesting persistent storage just entails providing a name and a desired size (my-app-data: 10Gi). Critically, developers use the Product Spec to state only what they need; determining how to satisfy these stated requirements is left to the Apollo Service Management Plane.

WitchcraftServices and the Apollo Service Management Plane

Applications compliant with the Apollo Product Spec directly map to WitchcraftServices, the Apollo Service Management Plane’s primary Kubernetes custom resource. The Service Management Plane is responsible for taking a WitchcraftService and producing a set of K8s resources which make up a working application, complete with all of the WitchcraftService’s declared traits. Deciding which resource kinds to use when servicing a WitchcraftService’s stated requirements and coordinating all these resources are handled programmatically by a set of control loops within the management plane. This logic includes how to bestow required attributes upon applications such as persistent storage or ingress, and spans several K8s resources.

The Service Management Plane takes WitchcraftServices and does the heavy lifting of coordinating several different K8s resources to achieve a working application.

The diagram above drives home exactly how much K8s complexity the Apollo Deployment Platform encapsulates. This doesn’t just involve choosing the right resources to create — it also handles the exhaustive alignment of inter-resource references, including label selectors (such as pod and node selectors), and explicit object name references (such as ConfigMap volume mounts).

Critically, the Service Management Plane codifies the decision-making for K8s resource usage and coordination in one place. Applying this standard across applications makes deployments repeatable without requiring effort by application developers who aren’t K8s experts.

This setup can protect developers from the gory details of deployment and can enable them to spend more time actually building. Applications that directly interact with K8s may still do so without forgoing the benefits of a standardized deployment story.

Apollo Product Spec Impact: Examples

Next, we look at a couple examples that demonstrate the concrete benefits the Service Management Plane provides developers. The Service Management Plane acts as a single place to encode Kubernetes usage instead of dealing with:

Navigating the complex pod scheduling configurations available in K8s; and
How to handle deprecated API removals.

Abstracting Away Pod Scheduling

Correctly configuring an application’s pod scheduling in Kubernetes is critical — pending or preempted pods can mean reduced availability or full-on outages. It is also hard. K8s exposes tons of pod scheduling and preemption knobs, including (but not limited to!):

Node selectors, affinities, and taints and tolerations;
Pod affinities and anti-affinities; and
Pod priority, topology spread constraints, and resource requests and limits.

Each of these concepts requires Kubernetes knowledge to use properly. Additionally, some of these knobs depend on cluster-specific knowledge to correctly set:

What labels/topology keys and taints do the cluster’s nodes have?
How frequently must a cluster patch and reboot/replace nodes? How common are node failures?
What set of a cluster’s nodes can fit my resource-intensive application?

Developers can end up grappling with details of a cluster’s node provisioning just to get a pod running. There can also be multiple ways to achieve desired scheduling behavior — taking different approaches across applications only makes debugging scheduling issues harder. Sustainable deployment in a cluster, let alone across different clusters, just isn’t as realistic when developers are burdened with defining these scheduling configurations.

The Apollo Product Spec’s level of abstraction neatly enables our Service Management Plane to own the management of scheduling configurations on behalf of developers. The Service Management Plane infers the scheduling intent of WitchcraftServices and creates appropriate scheduling configurations, including:

Node affinities such that application pods run on a dedicated set of nodes within a cluster;
Pod anti-affinities which mitigate the effects of involuntary node disruptions;
Pod topology spread constraints to alleviate failure of an availability zone; and
Pod disruption budgets to prevent evictions from impacting service’s required quorum for availability.

You can see a full description of the Service Management Plane’s support for pod scheduling in our newly published Apollo Service Management Plane’s pod controller documentation.

Once reached, the solutions here are not all that complicated — the real cost is traversing the learning curve to reach these solutions. Encoding this logic once in the Service Management Plane lets individual developers skip over this time sink.

Encapsulating these pod scheduling configurations has also greatly improved our ability to deploy software to new environments, such as customer-provisioned K8s clusters. Instead of a painful audit of all applications making up the Foundry platform, for instance, we have been able to make tightly scoped changes within the Service Management Plane itself to support different scheduling configurations as necessary.

The Service Management Plane’s ability to accommodate new underlying infrastructure setups extends to shielding developers from changes in underlying infrastructure APIs as well. Below, we’ll look at how this simplifies dealing with deprecated K8s API removals.

Handling API Deprecations On Behalf of Developers

Dealing with K8s API removals is a pain. Without thorough auditing, upgrading to a K8s version which removes a deprecated API can cause outages, and chasing down offenders takes time and effort. Handling cluster upgrades should be isolated to as small a set of engineers as possible.

The API boundary provided by the Apollo Product Spec simplifies a large swath of these API removals, since the Service Management Plane handles all the K8s API usage necessary for deploying services. When deploying a WitchcraftService, for instance, the Service Management Plane creates a PodDisruptionBudget on the application’s behalf. The K8s 1.25 release will stop serving the v1beta1 version of PodDisruptionBudgets in favor of the stable v1 API. Handling this API removal amounted to just one PR in one of the Service Management Plane’ controllers. Without the Apollo Product Spec, PDBs could be defined in each of the hundreds of applications composing the Palantir Foundry and Palantir Gotham platforms — necessitating an exhaustive audit in order to confidently upgrade our K8s environments.

Recap

Kubernetes is a complex platform with a steep learning curve that often overwhelms newcomers. Making developers figure out how to repeatably deploy applications directly to K8s is a massive time sink. We created the Apollo Product Specification to enable developers to avoid this trap while leveraging the power of K8s. Our Apollo SDK facilitates building Spec-compliant apps, or WitchcraftServices, and our Service Management Plane programmatically handles the heavy lifting of turning these WitchcraftServices into the right K8s resources. This setup has not only protected our developers’ time — it has also given our Service Management Plane space to transparently add support for different underlying infrastructures, taking our software to increasingly challenging environments.

Learn more about Apollo: https://www.palantir.com/platforms/apollo

Work with Apollo: Palantir Technologies (lever.co)