Introducing Rubix: Kubernetes at Palantir

Published in

Palantir Blog

5 min readMay 16, 2019

In January of 2017, Palantir kicked off a project called Rubix, our effort to rebuild our cloud architecture around Kubernetes. Given the majority of our cloud instances are dedicated to computation, one of the core objectives of this effort was to create a secure, scalable, and intelligent scheduling and execution engine for Spark and other distributed compute frameworks. We’ve now successfully rolled out Rubix to the majority of our fleet, and we’d like to share a deep dive into a few of the challenges we faced. We will kick this off with a series of blog posts focusing on the particular challenges of operating compute clusters on top of Kubernetes. Let’s go!

Background

Palantir Foundry is a data management platform that — among other things — enables users to integrate data by authoring and executing data transformation and pipelining code, and to perform ad-hoc and exploratory data analytics with programmatic and graphical query interfaces. Data analysis and transformation logic is executed on distributed compute frameworks such as Apache Spark. Foundry is a multi-tenant platform that allows users with different permissions to collaborate on shared data and code assets; a number of our customers use Foundry not only in-house, but also to provide their vendors and partners access to a common, collaborative environment in which all users can access the data powering their businesses.

Over time, Foundry evolved in ways that affected our deployment infrastructure priorities. One of the primary factors was a shift from exploratory data analysis to operational decision making: as customers began using pipelines to drive business decisions in real time, predictable performance became increasingly important. Specifically, we needed to guarantee execution time with lower variance than our initial cloud architecture — executing Spark jobs on Apache YARN with fixed resources — could support.

Rubix

As Kubernetes emerged as the standard deployment platform for modern PaaS systems, we made the decision in 2017 to migrate our in-house deployment infrastructure to Kubernetes. Because the two systems share a similar design, the decision to go all-in on Kubernetes seemed straightforward — at least for applications and services.

However, we wanted to target a unified deployment substrate for applications and for compute clusters. The pivotal open question was then whether Kubernetes-backed compute clusters could satisfy the two critical requirements sketched above: (1) multi-tenant security in the presence of user-authored code, and (2) predictable performance. In particular, we evaluated Kubernetes against the obvious alternatives at time, Apache YARN and Mesos. Let’s look at these two challenges in more detail.

Security

Our first-generation cloud architecture had two means of executing user-provided code: we ran Spark applications on Apache YARN, and we executed other types of user-authored code on an in-house container solution. As containerization technology matured, we wanted to take advantage of its security benefits for all user code in Foundry, not just those we were running via our in-house solution. While YARN’s support for containers was still immature at the time, Kubernetes was compelling from a security perspective for the following reasons:

It provides a robust set of features for running a variety of workloads in containers. Mechanisms like pod security contexts are similar to the features we implemented previously with our container work.
The security concepts in Kubernetes govern both the built-in resources (e.g., pods) and the resources managed by extensions like the one we implemented for Spark-on-Kubernetes. This allowed us to design a consistent approach to security across all types of first-party and user-authored code.

We’ll be diving into more detail on how we secure Kubernetes in a later blog post.

Predictable performance and cost

Our customers were willing to pay for dedicated consistency, but not for statically allocated consistency. For example, if a job runs for 1 minute today, users expect the same job to run in ~1 minute tomorrow. If they run it once an hour for 1 minute each day, users expect to pay for 24 minutes, not for 24 hours of dedicated hardware per day. And of course, if they have the option to have those resources almost immediately, then that’s always preferable to getting those resources an hour later. To meet these demands, we needed to move away from static cluster sizes and variable job resources, towards more dynamic cluster sizes and consistent job resources. Like YARN and Mesos, Kubernetes in principle allowed us to independently tune performance and cost through cluster autoscaling.

The devil, of course, turned out to be in the details: from tuning CIDR block sizes for the worker group subnets to minimizing instance boot time in order to handle large scale-ups in short periods of time. Most importantly, we extended the Kubernetes scheduler implementation in order to make compute performance predictable. The next blog post in this series will go into much more detail on our work on Spark-on-Kubernetes and on the challenges of scheduling Spark applications.

Where we are today

Given the challenges we faced, we could have chosen to continue investing in YARN and redesigned our cloud architecture to enable dynamic scaling and ephemeral YARN clusters. But more and more, we felt ourselves fighting an uphill battle, with every step forward requiring an increasingly bespoke solution. YARN also wasn’t well-suited for most of our user code workloads outside of Spark.

At the same time, Kubernetes gained traction in the wider community and was clearly the path forward. So we decided to move forward with Kubernetes. We implemented support for Spark and other user code frameworks, evolved how we handle container networking, and worked on securing Kubernetes at scale. Now, two years into our journey with Rubix, Kubernetes is live across our production fleet, allowing Foundry users to run hundreds of thousands of Spark jobs across thousands of nodes each day.

We’d like to start sharing our insights into this fun, complex, and ever-evolving puzzle. Some of the approaches we took to particular problems might not be the ones you or your team would have taken, but we hope these posts can help inform your own forays in the Kubernetes space and provide a basis for continued discussion and problem solving within the community. We welcome comments, suggestions, and debate along the way.

You can read our blog post on Spark scheduling today, and we have a few ideas for other blog posts in the future, for example:

The evolution of container networking in Rubix — From Calico to Lyft’s vpc-ipvlan-cni (and beyond?)
Running Kubernetes on ephemeral and immutable infrastructure
Programmatically generating secure, scoped connections between production services using AWS PrivateLink

We will also discuss Rubix and running Kubernetes on ephemeral infrastructure at the upcoming KubeCon in Barcelona on May 23, 2019. We hope to see you there!

UPDATE: The KubeCon Barcelona video is now available: https://www.youtube.com/watch?v=7zUsc2zI5o8

Candidates who are interested in joining the Deployment Infrastructure team can apply here.

—

Authors

Amy C., Greg D., Robert F.