Ephemeral by Design

The Estate Nobody Planned For

Most enterprise teams don’t set out to build an unmanageable Kubernetes estate. They start with one or two clusters, move fast, and ship, iterate, ship some more, you get the picture. Then a new project comes along which needs its own environment. A compliance boundary requires isolation. Another team gets their own non-prod cluster. Before long, there are eight clusters running across four different Kubernetes versions, two of which are out of support, and nobody is quite sure who owns the upgrade process for any of them.

This isn’t a tooling problem. Azure has the tooling. The problem is that nobody ever made a deliberate decision about how the estate should be operated.

When Non-Prod Stops Validating Anything

If I was a gambler…

I’d say you’ve definitely seen this pattern even if you haven’t named it. A cluster gets created for a specific purpose, whether that’s for staging, UAT, or perhaps a dev team’s sandbox. It gets stood up once, handed over, and never revisited with the same rigor as production. Cluster upgrades happen reactively, when something breaks or when a support deadline forces the issue. Configuration changes accumulate directly against the cluster, via “Portal Adventures”, rather than through Infrastructure as Code, because the code was never really kept current. After a year, the staging environment no longer reflects production. After two years, it’s running a Kubernetes version that production upgraded away from six months ago.

The result is a validation environment that no longer validates anything meaningful. And because nobody defined what “ownership” means for non-prod infrastructure, the upgrade responsibility has quietly fallen through the cracks.

This is how a fleet of five clusters becomes operationally equivalent to a rack of servers that’s been running so long that no one remembers. The tooling is different. The outcome is the same.

This is the cattle-vs-pets problem applied to your Kubernetes estate. The terminology is a decade old. The problem persists.

Production Clusters Upgrade. Non-Prod Clusters Disappear

The fix to this issue isn’t a better tagging strategy or more monitoring. It’s deliberate choices about how each class of cluster is treated and then holding to those choices.

Production clusters are first-class infrastructure. They have defined owners, a documented upgrade strategy, and an upgrade schedule that doesn’t depend on someone having a quiet week. The upgrade process is tested, not improvised. Configuration is managed through Infrastructure as Code that is treated as a primary deliverable, not a rushed afterthought. Version mismatch across production clusters is a risk that is actively managed, not discovered in a support ticket.

Non-production clusters are ephemeral by design. They are not persistent environments that accumulate history. They exist for the duration of a release cycle, scoped to a specific validation window, and torn down when that window closes. The next release gets a fresh cluster, rebuilt from scratch using the same code that provisions production. This is the point: because the Infrastructure as Code is shared, non-prod is always consistent with production, ALWAYS. Configuration drift becomes structurally impossible.

Ephemeral by Design is a philosophy, but it is also an engineering problem. Rebuilding a cluster from scratch on a release cadence requires a provisioning pipeline that is fast and reliable enough to make it practical. It needs to be Infrastructure as Code that is actually complete rather than only complete if your senior engineer is riding the deployment rapids and WILLING it to deploy correctly. The code should also provide a clear answer to what happens to data and state at teardown. These aren’t unsolvable problems, but none of them are free either. The mechanics of making ephemeral non-prod clusters work in practice: what the pipeline looks like, how the Infrastructure as Code needs to be structured, and how bootstrapping and teardown are handled. All of these are worth at least one article of their own. We will come back to them in the near future in detail.

This model extends naturally to team isolation. When multiple release streams are running in parallel, each team gets their own ephemeral cluster for their release cycle, provisioned from the same IaC baseline. There’s no contention over shared environments, no “who broke staging” conversations, and no configuration bleed between release candidates. Parallel release teams get parallel environments, where each is a clean slate, that’s scoped to a specific release, and destroyed when the release ships, automatically. The isolation that would normally require careful environment management and coordination comes for free as a consequence of the model.

This distinction sounds obvious when stated plainly. In practice, I’ve seen very few teams enforce it.

Pushbacks Happen

There are three pushbacks that come up consistently when ephemeral non-prod clusters are raised as a recommendation. Let’s look at all three.

“We have stateful test data living in the cluster.” This is a real constraint, and it is also an antipattern worth naming. Data that requires a long-lived cluster to exist, such as seed data, test datasets, or accumulated state, is a pipeline problem. The test environment has become a database, which means it can no longer serve its primary purpose as a validation environment. The path forward is pulling that data into a managed data store with a proper seeding process, not preserving the cluster to protect the workaround. If rebuilding means losing something important, that’s worth paying attention to as it usually points to a gap that will surface somewhere else.

“Rebuilding every release is too much effort.” Yes. That is the point. If rebuilding a cluster from code is genuinely difficult. For example, if it takes days, requires manual steps, or produces inconsistent results, then the Infrastructure as Code is not in the shape it needs to be for production operations either. The friction is revealing a gap, not creating one. A team that can’t confidently rebuild a non-prod cluster from scratch is a team that also cannot confidently recover a production cluster after a disaster. Ephemeral by Design doesn’t create the operational maturity problem. It surfaces it.

“Running a cluster per release team sounds expensive.” This is the objection that deserves the most careful answer, because the cost concern is valid. Consider what persistent non-prod infrastructure actually costs. A staging cluster running 24 hours a day, seven days a week, between releases is paying for availability that nobody is using. An ephemeral cluster built from production code inherits production scaling configuration, the same autoscaler settings, the same node pool rules. When it isn’t under load, it behaves exactly like production under no load: it scales down. You pay for what you use, for as long as you use it.

The deeper cost argument is harder to put a number on but more significant in the long run. Drift between non-prod and production is not a free problem. When a release passes validation in an environment that no longer reflects production, the failure that surfaces after deployment carries a cost. Whether it’s remediation time, incident response, or potential customer impact. That cost doesn’t show up on your Azure bill, but it is real and it compounds. So does the operational overhead of maintaining a growing estate of long-lived clusters, each requiring individual attention, each accumulating its own history of manual changes and deferred upgrades.

The engineering investment to make ephemeral non-prod work, such as maturing the Infrastructure as Code, automating the provisioning pipeline, and solving the test data problem, will pay back across all three dimensions. Lower persistent infrastructure spend. Fewer drift-related incidents. Reduced operational overhead as the estate grows. The upfront cost is real. So is the return.

When Five Clusters Becomes Fifteen

Getting deliberate about cluster lifecycle for a handful of environments is achievable with process and discipline. At five clusters, a shared document and clear ownership can carry you a long way. At fifteen clusters, spread across multiple teams, regions, and workload boundaries causes the operational weight to shift. Manual coordination doesn’t scale well or at all. Version consistency across the estate doesn’t happen by accident. Upgrade orchestration across a fleet is a different class of problem than upgrading a single cluster.

That’s where this series goes next. The lifecycle principles in this article hold at any scale. But enforcing them across a growing fleet requires more than good intentions. It requires tooling and patterns designed for multi-cluster operation.

Next in this series: Azure Kubernetes Fleet Manager, what it actually delivers for AKS estate management, where it fits, and where the gaps remain.

If you’re running non-prod clusters that haven’t been rebuilt in six months, rebuild one from your Infrastructure as Code this week and document what breaks. The friction is the signal.

Questions or feedback? Feel free to reach out!