Capacity operations is the science and supporting processes of ensuring that applications have the necessary infrastructure to meet their requirements, without having too much. This emerging discipline continuously optimizes compute resources in cloud and container environments via recommendations and directives that are generated from deep analytics. It fills a management gap that has developed between established DevOps and FinOps processes—where analysis of the true resource requirements of applications isn’t performed by either practice—causing inflated bills and unnecessary operational risk.
Many organizations have focused cloud optimization initiatives on optimizing their purchasing and properly quantifying and allocate costs—the central focus of FinOps. In doing so, many have lost focus on the actual resources being used—traditionally the focus of capacity management—leaving resource specification to application developers and creating a gap in the capacity management process. This is further amplified by the adoption of containerized infrastructure and Kubernetes, where the specification of resources is much more granular, and getting it wrong at the micro-level has a huge impact on efficiency and risk in aggregate—visible to the business as rising cloud costs and subpar service quality.
capacity operations exists to fill this gap, providing a discipline to maximize the efficiency, availability, and speed-of-management of cloud and container environments by optimizing the actual resources being used, not just how they are being purchased, thus returning organizations full circle to a more disciplined approach to capacity.
Many organizations have significant untapped optimization potential, and even going after just the “low hanging fruit” can justify capacity operations many times over. These organizations have typically adopted technology that allows them to make sense of the cloud bills, and even achieved a level of cost savings through the use of Reserved Instances, Savings Plans, and Committed Use Discounts. But, they have hit a wall when it comes to optimizing the actual resources being used, and they quickly realize that managing the bill isn’t the same as managing capacity. Adding to this, the deployment of containers typically creates a whole new level of stress at a scale that is impossible to optimize manually.
Capacity operations sets a high bar for Development and Ops teams, but ultimately saves them time, eliminates concerns, speeds delivery, and creates meaningful business benefits:
Capacity operations plugs the financial and risk management holes left by the often myopic focus on cloud financials and requires the participation of some of the same experts responsible for DevOps. Ownership of the process and required tooling will vary depending on the size and maturity of automated processes within your organization, but important stakeholders typically include:
Cloud infrastructure enables you to deploy production resources in seconds through API calls or infrastructure as code solutions in an elastic micropurchasing model that eliminates the need for many traditional capacity management activities. But, this doesn’t mean capacity can be ignored.
Cloud demands a completely different set of resource optimization activities with new fundamental assumptions. For example, even taking inventory is now very different than in on-prem environments, more closely resembling a “stock chart” of ups and downs than a static count. Even traditional capacity operations like rightsizing VMs now requires a different approach that is subject to decentralized decision-making. The micropurchasing model enables agility, but also forces resourcing decisions into the hands of engineers and developers who often do not have sufficient information to make the right choice. Suboptimal resource micropurchases, in aggregate, can amount to significant costs and tremendous inefficiencies.
There are a set of fundamental capacity optimization practices that must be performed in order to make sure that the right cloud resources are deployed at any point in time, including:
Capacity operations provides a formalized practice to ensure that these operations are performed and happen in a precise and continuous manner. Many organizations have focused on the cloud bill and have achieved a high level of maturity with respect to how resources are purchased. Capacity operations adds to this by also optimizing what resources are being purchased.
Containers are even more dynamic and granular with orders of magnitude greater entities that must be optimized. Containers can be combined into pods, replica sets, deployments, and other structures, which can be launched with common manifests—and all of these structures can be governed by various quotas to control resource usage. Containers have undeniable benefits when it comes to the flexibility and agility they provide, but providing suboptimal resource specifications creates tremendous inefficiencies at scale, leaving resources stranded and node utilization very low.
Containers do not overcommit resources in the same way VMs do, meaning that cluster administrators cannot simply tune environments to get higher density. Resources assigned to containers are actual resources, meaning they cannot be given out to multiple consumers at the same time. This removes a key weapon in the battle against inefficiency, and any resource overspecification translates directly into the need for more infrastructure, directly impacting cost.
At scale, tiny inefficiencies with each container snowball, and even containers or microservices that run for a very short length of time have a meaningful impact. To combat this, capacity operations addresses the following:
FinOps—cloud financial operations—is directed at providing transparency and insight to the costs associated with infrastructure spend. This includes a broad array of operations related to managing the financial side of cloud consumption, from chargeback to taxation to optimizing discounts. Although optimizing capacity has a significant impact on costs, many FinOps teams do not have the bandwidth or specialization to optimize the actual resource consumption, ensure the optimal selection of infrastructure, configure optimal application scaling parameters, etc..
While cost efficiency is a benefit of both FinOps and capacity operations, the two disciplines approach this goal from very different directions, and capacity operations is specifically designed to address the range of resource optimization strategies that are possible in cloud and container infrastructure, including spending more if necessary. Capacity operations enables developers with an API for optimization so they can include proper sizing as part of their automation pipelines.
DevOps is generally focused on delivering applications and services at high velocity, and provides a revolutionary paradigm for dynamic, continuous application development, delivery, and operation. But, the optimization of resources being used by these applications is typically not within the scope of the DevOps process, and continuous deployment rarely includes continuous optimization. This is because the resources required by the application components are often defined “upstream” by the developers or engineers working with the toolchain (via Terraform, Helm, etc.), and these people often don’t have access to insights or analytics that can help them make this decision.
capacity operations provides this transparency, giving precise optimization recommendations that can be used by DevOps teams to guide their decisions, and can even be embedded in their templates to provide optimization automatically.
There are several main differences:
A Capacity operations system will include:
When delivered in a SaaS model, this enables integration into virtually any ecosystem, leveraging existing data sources, management systems and automation frameworks in order to provide capacity optimization. A capacity operations system should be minimally intrusive, and should only require read-only credentials to get up and running.
In order to construct a model of a cloud or container environment, a capacity operations system should be capable of acquiring resource-level data from cloud APIs (e.g. AWS CloudWatch or Azure Resource Manager), billing data (AWS CUR, Azure cost information), container data (e.g. Prometheus), and any associated tagging or metadata. This enables the analytics to build a complete model of the supply and demand in the target environments and generate precise, actionable answers.
Core to the analysis process is deciphering usage patterns and relating them to resource requirements, which requires machine learning. By learning the patterns of activity and then performing deep analysis of the workload patterns against cloud catalogs and container resource models, precise recommendations can be generated.
Capacity operations is designed to align resources with the real-time demands of applications and business services according to their SLAs. This includes matching an application’s demand profile (i.e. CPU-intensive, nighttime batch processing, etc.) with the available supply and optimal scaling characteristics (i.e. scaling min/max, requests/limits, etc.). To do this, capacity operations generates numerous types of recommendations, including:
All of this is driven by detailed policies that reflect the requirements, desired strategy, criticality, etc. for a given set of workloads.
A capacity operations system produces two types of output:
Both of these forms of output are API accessible, and both can be used together, enabling approval-based automation workflows.
Although automation is typically a long-term goal, automation is not initially required to achieve significant business benefit. Many organizations start by simply ranking the efficiency of different areas of their organization, such as lines of business or app teams, providing an incentive to make improvements based on the capacity operations analytics. Combining this with detailed impact analysis reports and intelligent prioritization, tangible benefits can be achieved without automation. When policies have been tuned and a high level of confidence in the capacity operations technology has been achieved, then the next logical step is to enable automation, accelerating the business benefit and reducing the involvement of stakeholders.
Manual capacity optimization will not achieve the full benefits intended by capacity operations. You must eventually automate the decisioning and adjustment of component sizes at scale to overcome barriers presented by suboptimal resource allocation at scale.