Surviving the Server Chip Shortage

calendar February 14, 2022

Everyone is Struggling with Lengthy Lead Times for Server Procurement

The global chip shortage, which began in 2020, continues as demand for semiconductor chips continues to far outpace production. Intel CEO Pat Gelsinger recently forecast shortages to be sustained through at least the remainder of 2022.

As a result, IT operations teams at almost every company we’ve talked with have felt the crunch in the form of skyrocketing prices and delays of up to a year for procurement of physical servers. Service delivery teams are now faced with exceedingly difficult planning and budgeting scenarios. Getting approval for these large capital expenses was difficult enough when lead times were three months.

Weighing Options for Minimizing the Impact of the Server Crunch

There are a few responses that we hear often:

  • Move more services to the AWS, Azure, or Google public clouds
  • Place orders well in advance based on speculation you will need more servers
  • Extend the life of existing servers by delaying your normal lifecycle refreshes
  • Outsource to third-party service providers (whenever they can actually commit to supply)

Another solution, that we will explore in depth, is to improve the efficiency of your existing servers to accommodate your internal demand and growth.

In practice, it is common that existing applications are allocated far more resources than they require. Across all environments, find that approximately 45% of all workloads are over-resourced with vCPU, memory, and storage. If you could identify those specific workloads and reclaim that unused capacity, you could re-assign that capacity to provision additional applications onto your existing server pool.

Most organizations tackle this at the server level first, and then optimize at the virtual machine (VM) level.

Server Optimization

For each cluster, you can calculate the overall number of VMs (based on average size) that the cluster can accommodate and total the number of actual VMs (based on the same average size) running in that cluster. The difference can be used to identify the number of servers that can be de-commissioned. You will need to look at all resources: vCPU, memory, and disk; and you can’t combine vCPU from one host with memory from another, so it is best to rely on a proven analysis product to calculate this for you.

Below is a simple example of a report that identifies the number of surplus hosts for a set of clusters.

Environment capacity report showing the number of hosts in surplus or shortfall within each cluster
Cluster Number of Hosts Number of VMs Number of Hosts for HA Number of Hosts in Maintenance Mode Number of Required Hosts Number of Hosts in Surplus/Shortfall
E1-POD-L01 17 119 1 0 9 7
E2-SMI 6 54 1 0 5 0
IPC-1-SMI 3 27 1 0 3 -1
IPC-2-SMI 6 67 1 0 4 1
IPC-POD-A 9 61 1 1 6 2
E3-POD-NW-04 17 73 1 1 9 7

VM Optimization

Next you can look at each VM to determine if it has excess capacity that can be reclaimed and re-allocated.

There are few key steps to freeing up this unused capacity sitting in your data center:

Step 1
Identify workloads that are using far less resources than allocated
Step 2
Determine the number of resources that can be reclaimed from each workload
Step 3
Reclaim those specified resources
Step 4
Assign reclaimed resources to new builds or growth workloads

Step 1: Identify Workloads That Are Using Far Less Resources than Allocated

We recommend that you set thresholds for each resource, so that you can systematically analyze each workload to determine which ones have been allocated excess resources.

Typical resource threshold values are:

  • 30% or less vCPU utilization
  • 40% or less memory utilization
  • 35% for storage

Next, you need to consider the precise metric you want to measure: Is it average utilization, sustained utilization, or peak utilization? And do you look at 90% percentile, or 95%, or perhaps 98%?

Finally, determine the appropriate period for analysis—the past week, month, or quarter—and similarly, busiest day versus averages.

We also recommend you consider using different thresholds depending on whether you are looking at a dev, test, or production environment. We do not recommend looking at averages or low percentiles, as these often mask critical requirements for the application.

At Densify, we provide a set of policies to help you configure each of these parameters so that the analysis and recommendations are based on your specific organizational requirements.

Step 2: Determine the Number of Resources That Can Be Reclaimed from Each Workload

Ideally, you can set thresholds to enable precise calculation of this number. High thresholds of 80% vCPU and 90% memory are common.

Once these are set, you can calculate the exact number of resources that can be reclaimed to place that workload in-between your low and high threshold settings. We have found that a graphical view of the before and after resource consumption measurements goes a long way in making an application owner comfortable with the recommendation.

If you have established defined workload sizes (often referred to as t-shirt sizing), such as 2 vCPU, 4 GB memory and 4 vCPU, 8 GB memory, make sure your recommendations adhere to these specifications.

Step 3: Reclaim Those Specified Resources

This can be a one or two-step process. In many organizations, the first step is a change request sent to the workload/application owner approving the change. The second step is a change request to implement the same. We see some customers streamlining this process by combining these tasks into a workflow that flows from a single change request.

Once approved, you can manually adjust the workload resource configuration during a maintenance window or use an automated process. Automated processes can be set to make changes only during a defined maintenance window, or at a specified time outside of prime business hours. The procedure itself is quick, but does require a shutdown and restart of the instance.

Step 4: Assign Reclaimed Resources to New Builds or Growth Workloads

This is the easiest and most rewarding step. Depending on your provisioning process, you should immediate visibility and access to these reclaimed resources to provision new workloads.

Reclaiming Capacity at Your Organization

Before you begin the onerous task of purchasing servers this year, first determine if you really need them. Working with solution providers like Densify, you can quickly determine which servers can be decommissioned, and the amount of resources that can be reclaimed to accommodate this year’s new demand—often, these resources can be quite significant.

We’d love to chat with you to understand your situation, and give you some tips while demonstrating what we can do to help your organization free up resources. Connect with us for a short exploratory conversation and demo.