The Benefits of Remote Ephemeral Workspaces

Palantir
Palantir Blog
Published in
9 min readJun 6, 2022

As one of the DevOps teams at Palantir, we are always looking for new ways to improve the productivity of our developers. We recently decided to explore how remote workspaces — including AWS EC2 instances, AWS Cloud9, or dedicated machines — could be beneficial. There are lots of articles about the benefits of remote workspaces, like this one, this one, this one, this one, this one, or this one (and many more!). At Palantir, we took this paradigm one step further and used an ephemeral remote workspace model to improve local build times, make developers more productive, and reduce onboarding costs. And we did all this without breaking the bank!

Complexities of an Enterprise Environment

While the referenced articles emphasize the many benefits of remote development, we feel there are additional complexities of a large enterprise environment, particularly as they relate to performance efficiency, standardization, and optimizing for operational efficiency.

Complexities to be considered include:

Performance Efficiency

  • Building large complex software products requires heavy compute resources (e.g., GPUs for ML training, increased RAM for tests and build), forcing developers to consider alternatives beyond laptops such as desktop towers, racked servers, or spawning a VM.

Standardization

  • Delivering development infrastructure to large organizations at scale without sacrificing development speed requires standardized offerings to help reduce operation and maintenance (O&M) burden.
  • As development environment recommendations change, the configuration drift causes increased support burden, resulting in developers consistently running into the “works on my machine” syndrome.
  • Providing environments that more closely represent how code is running on production, from system libraries (e.g., specific Python version) to deployment models (e.g., Docker containers).

Optimization for Operational Efficiency

  • Hiring and onboarding new people in a complex environment can be frustrating and requires time to initially configure.
  • A given employee will be potentially going in and out of multiple development environments. They may need to work in different types of repos (front-end vs back-end) or have different competing requirements (i.e., different Java versions). This can contribute to significant overhead costs as they work through different requirements.
  • You have to control your development costs. If your costs are growing faster than the benefits you’re providing, you’re probably not going to be a company for very long.

Solving for Complexity

At Palantir, we are big believers in ephemeral infrastructure and use this pattern in many of our production environments to address different challenges. (Check out this blog post on our Rubix platform or this one on Active Directory for how and why). For maximum impact and lower operational burden, we knew that if we could combine the benefits of ephemeral infrastructure with the benefits of remote workspaces, we could unlock a new level of productivity for the company.

Some Options

Before jumping all the way into our solution, let’s go over some potential options.

  • Get people more capable laptops. This addresses the lack of compute but does not help organizations scale with cost. With so many working from home and recent supply chain issues, getting more capable laptops can also be logistically complicated.
  • Have people provision their own VMs in the cloud. Palantir already has an internal service for people to self-service virtual machines (VMs) so this approach would allow us to leverage an existing capability. While some developers had already started using the remote development model on their own, this option requires a lot of configuration overhead per developer and can still lead to configuration issues and “works on my (other) machine” syndrome.
  • Get people pre-configured VMs. While this option means developers would have to manage less, there would be a heavier burden on the infrastructure team (us) to make sure pre-configured VMs were patched, updated, and available when the developer needed them. This kind of overhead is something we were hoping to avoid and fortunately there are a few services in this space such as Amazon Workspaces and Google Virtual Desktops that provide a good starting point.
  • Offload the management of some infrastructure. Offload portions of the infrastructure to a SaaS provider like GitHub Codespaces or something else to run on a container farm on Kubernetes. This lowers developer overhead while reducing the need for the infrastructure team (us again) to manage individuals. We wanted to avoid building our own product since we would have to worry about automating features like turning on a new workspaces, auto-updating containers, and SSH key management.

After exploring all of the above options, we eventually settled on using Coder on top of Amazon’s Elastic Kubernetes Service (EKS) as it met the broader needs of our developers (at the time of writing this blog).

What Excited Us About Coder

  • Operational Wins: It works on top of Kubernetes for the compute layer. When combined with Amazon EKS we could minimize the overhead for our team.
  • Standardization with Flexibility: Coder allows the user to have custom dotfiles and startup scripts. Developers have spent years optimizing their flow and this gives them the customization layer that they need to keep the environment feeling familiar.
  • Developer Empowerment: Self service access for provisioning more resources. With a few clicks, developers can deploy new environments, request more resources, and get to work faster.
  • Ecosystem Integration: It integrates natively with GitHub Enterprise. Users can easily clone repos and manage their own SSH setup. Coupled with self-service access features, such integrations eliminated a lot of the potential transactional support from our team.

But Does It Really Work?

After deploying the infrastructure, we knew our next most important goal was proving to developers that remote workspaces would greatly improve their lives. To do this, we partnered closely with our product development teams to ensure:

  • Latency and user experience was seamless
  • Development containers accurately represented the environment
  • User guides were created for each code base

Below are some of the benefits we got once we had deployed:

  • Faster development cycles (including it here again because of how big of a deal this is). Developers can get more CPU/RAM with the push of a button. This sped up the entire develop → build → test process across the board. For example front-end build, build times improved 78%!
  • Faster download times. Since the workspaces live right next to our code repository and artifact servers, download times and request latency are very low. For an example repository, this improved git clone times by 71%. While we expected faster development cycles in general, this was an unintended benefit.
  • Reduced onboarding and support. The dedicated development infrastructure teams began advocating for users to switch to Coder. They were able to onboard new users faster and had to debug fewer issues on people’s local machines. This meant they could focus more time on optimizing the developer tooling.
  • Increased external contributions. With an easier path to development environments, teams can easily contribute to each other’s codebases. This was especially useful for our engineers working directly with customers. The overhead for fixing small bugs was just requesting a workspace instead of downloading all of the product dependencies.
  • Better local laptop experience. With many demanding parts of the developer workflows offloaded, people are seeing better performance on their local machine. No longer are we hearing “I can’t share my screen because my server is running!”
  • Remote Docker environments. Some developers used the increased resources to create a hybrid environment. They did some of their development locally and ran docker and test servers remotely. We expected people to move all of their development over to their new workspace, but this showed that a hybrid approach was possible as well.

Faster Development Cycles Deep Dive

In this section, we offer some concrete examples of the improvements we saw.

For context, in this test we used a Foundry repository to build the front-end. The repository uses a mono-repo concept and heavily uses git-lfs, is ~4.5 GB in size.

These speed benefits speak for themselves. While smaller repositories saw improvements as well, this one highlights just how much faster things got. Research shows that it can take up to 20 minutes to get back into a state of flow, which is critical to staying productive. With hundreds of developers, that adds up to hundreds of hours of lost productivity every day. Whatever we can do to keep developers in the flow is a win for the whole organization.

Cost Considerations

Another important aspect to consider is cost. After all, giving developer’s infinite compute power would probably cost a lot of money. While we didn’t expect this project to reduce costs overall, we needed to make sure that we weren’t breaking the bank during our rollout.

Let’s take a look at a quick cost breakdown.

Macbooks. At Palantir, we replace a developer’s laptop roughly every two years. Most developers use a Macbook. For the standard developer today, we provide a 32GB M1 Macbook which costs roughly $2900. Spread that across two years, and that comes out to roughly $1500 per developer per year.

EC2 Instances. For the a new world with remote ephemeral workspaces, the cost considerations are slightly different. Rather than buying a beefy Macbook, we would instead buy a standard M1 Macbook, which comes in at roughly $2000. This would bring the laptop cost per developer down to $1000 per year but we still need to account for the new EC2 costs.

Most of our developers need more RAM, so we’ve decided to use r5.4xlarge instance types in Amazon. As of this writing, an r5.4xlarge costs $1.008 per hour. This comes out to $8830 per year. From here, we can make a few optimizations that most people won’t notice.

  • Optimization 1 — On this instance type, we can very comfortably fit three developers. This takes advantage of the containerization architecture.
  • Optimization 2 — We don’t need to have the instances on during the weekends. This means on average, we’re only paying for 5 out of 7 days. As long as we have automation that allows users to easily spin up during weekend hours, we’ll save good money here.
  • Optimization 3 — We don’t need to have the instances on 24 hours a day. Most developers work in a 12-hour window relative to their region, so we can turn these machines off during off hours. We can use similar automation to above to make sure developers are never blocked.

The differential is roughly a 41% increase in overall costs. However, there are other cost optimizations that could be considered: some developers will use smaller remote workspaces; some will not work a full 12 hours; and some developers don’t do development every day. For Palantir, this cost increase is more than offset by the productivity of our developers and better Operations & Maintenance (O&M) for us! Ultimately, it is up to your organization to decide if the cost trade off is worthwhile, but it certainly was for us.

Where We’re Going

We’re pretty excited about what we’ve been able to accomplish so far and we are seeing usage numbers increase as more developers find out about our new offering. Currently, we are seeing the most traction with our front-end repositories. These repositories are large and most developers are already using VSCode, which offers great support for remote development and has gotten good feedback from our developers. For our Java projects, we are testing other IDEs to ensure it provides a high quality experience for our developers. We continue to see improvements and hope to roll out ephemeral remote workspaces out more broadly soon.

Conclusion

Our experience reaffirmed remote developer environments as a great way to increase developer productivity. More importantly, applying ephemeral infrastructure paradigms is a sustainable and cost-effective way to manage these high-value investments.

If you want to help us build this future and drive other important outcomes, please come join us!

Author

Ashir

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Responses (2)

What are your thoughts?

Thanks for the post! Can you share why you choose coder.com over GitHub Codespaces?

--

Thanks for sharing this interesting journey! Are you using something like Dazzle (by Gitpod) to create the workspace Docker images for each different environment?

--