agmatei
on 27 June 2024
Managed Apps on Public Cloud: Why Operations Matter, Part I
You might be tempted to think that running an app on a public cloud means you don’t need to maintain it. While that would be wonderful, it would require help from the public cloud providers and app developers themselves, and possibly a range of mythological creatures with magic powers. This is because any app, regardless of the infrastructure on which it runs or its output, requires maintenance in order to yield accurate and reliable outputs.
Consequently, even though you likely saved a fortune on upfront costs by choosing a public cloud infrastructure, you still need to spend on ensuring your application stack is operating efficiently and as intended. Neglecting your maintenance could bring significant detriments to your business, projects, or scope. For that reason, outsourcing the operations of apps running on public cloud infrastructure is a good idea. Let’s take a closer look at why.
Choosing Public Cloud
When to pick public cloud infrastructure vs on-premises deployments is a conundrum that has been boggling DevOps and GitOps engineers for decades. They both have their upsides and downsides, but ultimately the choice comes down to your use case.
Let’s take a look at uptime as an example. A private cloud would need to constantly run at near capacity to be cost-efficient – meaning, you would need to know what you’re going to be running on it and have an accurate estimation of how much resources those workloads will require. This would justify the purchase of hardware, given the resources it offers will be consistently maximised.
In comparison, high-performance workloads (such as those found in machine learning) benefit more from a highly flexible environment that can quickly increase the number of GPUs engaged in a computation to meet demand, and then decrease this number to a relatively low baseline for interpretation and optimisation. This can be achieved more easily with public clouds, with the added benefit of no upfront costs for hardware and installation.
For these reasons, public cloud becomes the favoured environment to deploy apps like AI/ML automation platforms (like Kubeflow), stream processing platforms (like Kafka), or any other application that operates using sporadic high influxes of data and computations that eventually settle to a relatively lower baseline.
Public Cloud App Operations
From an operational perspective, public cloud deployments require relatively the same ongoing maintenance as on-prem deployments. There is an ongoing fallacy in the market that deploying an app on a public cloud cluster automatically makes operations easier. This may be true in some isolated instances, but it often causes negligence that deters the app’s ultimate functionality. What truly differentiates public cloud form private deployments is the aim of your operations: while on-premises, you want stability and cohesion; on public cloud, you want to ensure flexibility and speed. Let us take a look at what operational excellence would look like in a public cloud deployment:
Performance monitoring
You want to be able to keep an eye on your applications and their performance. The trick is that this is a highly manual process, because each app comes with its own metrics. While most open source applications will have key indicators that are quantifiable (for example, in a Grafana dashboard), there are exceptions depending on both your needs and the app’s key functionality.
In general, performance metrics will include:
- Capacity reports (i.e. how many resources your app is using out of the cluster in which it’s deployed)
- Workload reports (i.e. how quickly your app does what it’s supposed to do)
- Error reports (i.e. what causes incidents to happen).
When setting up your deployment, it is key to ensure that you know what metrics you need to follow for each component, and make sure you’re implementing proper protocols that maintain constant and consistent performance measurement. This step can at times be manual, but has potential for automation in the hands of the right engineer.
Updates
It is no secret that you need to update all your apps – yes, even those on your phone – to ensure their optimal performance. The difference between the apps on your phone and the apps in your public cloud stack is that while not updating the former can result in decreased battery life and some delays, not updating the latter will result in your business losing money. And sometimes, lots of it.
Efficient updates can help you in multiple ways. First, they will give you access to the newest and greenest features of your apps; You already pay for them, so you might as well make the most of your money and maximise your tools’ functionalities. Second, the efficiency and performance of your ecosystem may increase with each update, because performance bugs are almost exclusively tackled with version updates (leaving critical bugs for patches, which we will discuss later). Third, and most importantly, updates enforce the security of your entire environment, because outdated app versions are often vulnerable to breaches or attacks.
It is therefore essential to make sure that all your applications are consistently updated. However, a recent update launch does not mean you should immediately apply it. Each release needs to be scrutinised for compatibility, checked for vulnerabilities, and holistically integrated within your larger stack. Furthermore, updates can require a lot of time, might cause downtime to your workloads, and need to be implemented with extensive backup protocols to ensure data loss is minimised. To put it briefly, updates require a lot of time and attention, but remain essential for anything deployed in public cloud clusters.
Security
The word “security” has a tendency to be incredibly nebulous. Do you imagine a high wall that keeps any invader at bay? A super-secure gate which only opens upon multiple scans of one’s iris, fingerprints, and tone of voice? You’d be partly right in both cases, but there is so much more to security when it comes to public cloud instances.
If there were one concept to pick that connects to all aspects of cybersecurity in multi-cloud environments, it would be ‘access.’ In effect, all the security practices you’d need to implement ensure that access is restricted only to those who need it, and even then only for the duration of their need. Any malicious or dangerous operation begins with a party obtaining unauthorised access to a certain part of your system. From there, they can either break what they’ve accessed, steal it, or try to make their way to even more components, placing you at higher and higher risk. When it comes to public cloud app security, once you’re in, you’re in. So you need to make sure that whoever is in is supposed to be there. We will discuss security on the public cloud in more detail in a standalone blogpost in the next few weeks, but for now let me briefly outline what it entails from an operational standpoint by breaking it down into four pillars:
- Configuration
Good security begins at the configuration of your cluster. Each component that makes up your public cloud stack must be correctly identified and documented, along with its variables initial set-up metrics. You need a constantly updated and highly detailed map of your entire infrastructure at all times, along with all the parameters associated with each component.
This, however, does not stop here. Public cloud workloads are supposed to be constantly dynamic (if they were static, you’d be better off with a private cloud deployment). This means that, at any point during your deployment’s lifecycle, there may be changes to its infrastructure. You must ensure that changes are well scrutinised and considered before implementation, proactively announced to all the necessary stakeholders, and finally implemented by the correct agents, in a highly secure way that is mindful of the larger stack. These processes can be automated, but the automation in itself is something that will require extensive expertise, as missing out on even the smallest detail can compromise your work and leave you vulnerable to attacks.
- Vulnerability management
Because we’re discussing open source app ecosystems in public cloud deployments, vulnerability management can be a delicate process. The developer of each app will likely have structurally similar, but logistically mismatching protocols for the identification, triage, and mitigation of vulnerabilities for their products. This means that, should you manage your operations yourself, you will have to follow several calendars and oblige varying timescales when it comes to vulnerability mitigations.
Mitigations can come in the form of updates, downgrades, or patches. But before applying anything, you must ensure that you yourself have a good system in place to identify and flag vulnerabilities, and either report them to app developers or fix them in-house. You can automate mitigations to a level, but you will require expertise in order to both develop and maintain this automation.
Note: Using a portfolio of apps developed or maintained by a single company can make vulnerability management much easier and create cohesion within your ecosystem. Canonical is an example of such a company, with an extensive portfolio of highly interoperable applications.
- Monitoring
Even the most successful and carefully deployed environments have incidents. These could be an external attack, or it could just be things going wrong within the amalgamation of components – whatever the situation, you need a detailed and highly performant monitoring system in place to tell you when they happen.
In essence, monitoring is a set of practices and procedures that ‘keep an eye’ on your systems and ensures that everything does only what it’s supposed to do. Efficient monitoring will also help you keep track of who does what on your clusters, and even what happens during successful or unsuccessful incidents. The benefits of monitoring are endless, and resemble those of documenting anything else: you have a chance to look back on the progression of your system, and also offer any interested stakeholder the ability to evaluate what is happening (or has been happening) at any given time on your infrastructure.
A more stringent form of monitoring is the security audit, which entails a trusted third party performing a deep-dive into your systems and their security protocols. The third party can be either a separate team within your company (in the case of larger enterprises) or a completely different entity. Audits are the best way to ensure that your security protocols are compliant and do not fall prey to human bias.
While there are plenty of things that can be done to automate monitoring, it is one of the security components that does, at this point at least, still require some manual operation. It will, therefore, consume a significant amount of operational resources, and cannot be overlooked.
- Incident and Recovery
Like I mentioned above, even the most successful deployments have incidents. You cannot control it, things can go wrong all the time. You may think of the worst case scenarios: security breaches, horrible errors, or your entire system going down. But even losing half of an application’s core functionality because of an incorrect configuration or update can cause significant disruption to your business processes, and ultimately to your profitability. This is why it is essential to establish and enforce a set of predefined incident management processes.
An incident management process refers to some variation of a handbook with hypothetical scenarios and predefined actions to be done in order to address them. To bring in some coding lingo, it is an extensive collection of ‘if’ statements that assume the worst; this ensures that no time is wasted on decision-making and implementation of security measures when an incident happens, which decreases the time required to restore your systems and augments the efficiency of your team during crisis situations.
Bear in mind that your own team and your customers need to benefit from an incident management process handbook. You want to make sure that every stakeholder that is in any way involved in your public cloud deployment has the option to report an incident and, if necessary, fix it.
Currently, the International Organisation for Standardisation (ISO) proposes several standardised incident management processes that support companies in remaining compliant with local and global legislation. To achieve compliance with ISO standards, you need to dedicate substantial engineering resources and tailor both the configuration and maintenance of your clusters to perfection. In many situations, you will have no choice, as adherence to standards is the only way to legally operate within certain industries like healthcare, telecommunications, or national security. It is useful to note that even without ISO certification, developing and following internal incident and recovery protocols is an essential security practice.
All in all, operational security can be an arduous task that will eat your resources and energy. It’s also mandatory, regardless of the scope and size of your organisation: you simply cannot operate with a non-secure environment. Otherwise, you risk not only being attacked and losing competitive advantage, but perhaps even receiving heavy fines from global legislators – or even losing the right to operate in various industries. Consequently, you must ensure that you dedicate utmost attention to your operational security.
Patching
Patching, or patch management, refers to the distribution and application of local updates, or in other words updates dedicated to individual servers or clusters. These updates are often relatively urgent, as they are essential to the correction of security vulnerabilities or performance errors.
Operationally speaking, several actions are required in order to maintain proper patching practices. Firstly, you must know as soon as possible when patches are available, and what your vulnerabilities are. It is also essential to have full visibility of your infrastructure, so you can know which endpoints are vulnerable and need patching. It is common practice for administrators to maintain reports on which patches were applied and where, as well as for which bugs. As your company grows, it would be wise to consider the adoption of standardised patching protocols as an internal rule, so that any new member of your team can join in with full transparency on how you patch.
Patching takes time, and each patch must be scrutinised. Generally, you will be able to automate a lot of the processes, but you will still need engineering wisdom and time in order to ensure both that the automation runs correctly and that the applied patches fit well within your larger public or multi-cloud ecosystem.
What else?
I know the above sounds like a lot – and the truth is, for a lot of companies, it is. Especially in cases where your primary focus is not related to software or hardware, you will likely want to focus on what really matters to you. Undoubtedly, it can be daunting to know that in order to stay competitive, you will almost certainly have to go digital and choose some level of cloud infrastructure. This will inevitably mean that you’ll be required to operate it to the standards mentioned above. However, there is a solution that many companies select every day to avoid this unnecessary headache: outsourcing.
With regards to public cloud infrastructure, outsourcing your operations involves starting a collaboration with a managed service provider (also known as an MSP) who will handle everything for you against some predefined contract rules. This can have many benefits, from giving you more predictability over your costs to liberating you to focus on your primary business scope.
In the next part of this blog post, we will look more in depth at the benefits and key considerations you should know before, during, and after you’ve selected an MSP for your public cloud deployments. If you are eager to find out more, get in touch with our team using this link.