robgibbon
on 26 October 2021
In defence of pet servers
We all know the drill by now: modern compute infrastructure needs to be deterministic, disposable, commoditised and repeatable. We’re all farmers now, and our server estates must be treated like cattle – ready for slaughter at a moment’s notice.
However, we must remember that the driver behind the new design rationale is primarily the unreliable nature of modern cloud compute infrastructure and its associated feeble Service-level agreements (SLAs). Let’s just take a step back from the cattle over pets mantra for a moment, and evaluate if this is really always the right path to be going down.
Temple computing
In past times, deployment engineers would carefully plan and rehearse making their prostrations before the software release altar, and once ever so gently lowered into place, would back slowly away from the new running service, all the while making their benedictions.
The entire process was usually a manual one, possibly documented to some extent in a runbook, but often requiring secret knowledge and wisdom passed in whispers from master to apprentice. These days, the knowledge and wisdom is encoded into automation solutions like Juju, Terraform and other automated infrastructure-as-code (IAC) enablers, and the release processes are no longer manually planned and rehearsed, but instead consist of fully automated software release and deployment procedures.
Thus in recent times, a great deal of investment has gone into improving the lives of deployment and operations teams. Site reliability engineering (SRE), sometimes known as the L2 support team, has similarly gained from improved tooling around observability – that is, metrics, monitoring, logging, and alerting, as well as post mortem diagnostic tools, intrusion detection and prevention systems, network anomaly detection, and so on.
But despite all of this massive investment in transforming systems management from a world of beloved pets into a world of unloved herds of cattle, there still remains this one hard reality: for many business use cases, long-lived systems with very high uptime are far easier to deploy and operate, have a massively cheaper total cost of ownership and are simply more appropriate than the new, heroically disposable systems architectures.
Web tech goes mainstream
Whilst a web-based social media solution can tolerate days of extended downtime on some backend components when the cloud region they depend upon goes down (and through clever use of caching you might not even notice); a safety-critical application, for example a highly transactional air traffic control system or a high voltage energy grid management application, cannot tolerate any downtime – even just a few minutes of unavailability can have severe consequences.
For those applications, building a highly resilient, multi-region, multi-cloud infrastructure that can assure extremely high uptime even when the underlying virtual infrastructure is offered to a very low SLA, quickly becomes far, far more expensive than just building a decent infrastructure to begin with. It’s like building a house on sand versus building a house on rock – not quite a fool’s errand, but still.
So how can we reconcile this situation? Obviously there are lots of benefits to automating the operations teams’ runbooks. Security operations automated response (SOAR) is one very tangible example – by automating well-rehearsed procedures for responding to a security incident, the entire event can be dealt with extremely rapidly. In many cases, the time it takes to shut down a detected security incident has a direct effect on its severity to the business.
For sure, systems fail all the time, especially at scale. But on the other hand, over-engineering a solution – in this case designing and building for disposable, shifting, unreliable infrastructure – might end up being more costly than just going out and buying enterprise-class equipment. If you don’t need to go to hyperscale, then architecting your solution for cloud infrastructure might not be the most cost-effective approach.
Certainly, pets need care, which means making a long-term commitment and investment to maintenance – ensuring that deployments remain reasonably up to date, and are defended against critical vulnerabilities and exposures (CVEs). But indeed, most of these solutions need some significant software maintenance investment, regardless of the approach to environment management. There is another nuance here – most long running, stateful deployments, regardless of how the individual systems that compose it are treated, can also often be considered a kind of “pet” needing significant care and attention.
And certainly, there are some use cases where the cattle paradigm really shines – for example launching a 10,000 node Apache Spark cluster for 10 minutes, crunching some really big data, and then terminating it and walking away. This approach ends up costing a few hundred dollars versus the hundreds of thousands of dollars that would be required to invest in owning this kind of platform.
However not every business use case will benefit from hyperscale cloud infrastructure, and whilst every organisation inevitably stands to benefit greatly from improved automation, getting the underlying infrastructure right can make the breakeven point to automation effort payoff much lower. And at that point, the dispassionate cattle farming mantra may start to fall away and we can go back to loving our pets.
Further reading: beat disruption – how to adapt your IT strategy for changing markets