Kubernetes Resource Control: Identifying waste

Kubernetes Resource Control

Part III: Identifying Waste

Part I: Navigation & Visibility

Part II: Identifying Risk

See the value of AI-driven analytics

Learn more about Kubernetes optimization »

Video transcript:

The third area I want to go into is cost savings. And this goes hand in hand with risk because we see a lot of stranded capacity in these environments, but sometimes you need to make sure you fix the risks before you go after the cost savings. I’ll explain that a little more in a minute, but let me go down into the histogram view.

In the previous section, I covered how to read this and what it’s saying. And this is for a given environment, what the histogram looks like as far as CPU requests CPU Memory limits And I want to focus this time on this yellow area where we have stranded CPU. Now, we also have this big gray area, so we have a lot of containers that are too big and a lot of containers don’t have a request value But if you look at the top right, we see that there’s an overall surplus of 33% so there’s quite a bit of waste in this environment. The yellow ones might be bigger ones, we tend to see that where there might be a bunch of tiny ones that don’t have a value, but there’s some big ones out there that have a really big value. And I’m going to drill down on that and take a look at that and that’s a source of cost savings.

When I see yellow in the histogram and I go down, I see things like this. This one has a surplus we’re estimating of about 1 CPU as memory and the reason we’re saying that is because this one has a request value of 1 CPU; that’s the yellow line but it’s doing on average 0. 26 millicores of activity. It’s really doing nothing except for once a day it’s spiking and so usually you don’t want to request for that spike because that’s going to kind of lock up the capacity all day long and cost quite a bit of money. We’re recommending to bring the request down closer to the actual utilization. Then when the spike occurs, it’ll get processing time because there’s plenty of space on the nodes. The nodes have lots of capacity so it should get it when it wants it but we don’t want it requested out and running all day long with that request because when we can schedule on a node it’s going to earmark whole CPU for this one workload.

And we see a lot of workloads that are very spiky with these high request values. From policy perspective, you can do that – you can request to the peak if that’s your policy but in aggregate, by making all these high request values, we’re straining a lot of capacity because again, these nodes are nowhere near fully utilized from a CPU perspective. And you can see in the columns here we sum up the total requests and the surplus and there’s 63 CPUs of surplus out of 180 that are deployed; so about a third of the environment is surplus CPU. So going through this again, you can see that it’s all over the place. There’s lots of them that are too big from a from a CPU perspective. I’ll just click my way down here. Here’s another case where we’re requesting half a CPU but these are not very big workloads.

This is clear indication of if we were to optimize these it would free up Kubernetes to run on fewer nodes because it’s going to impact better; it’s going to pack more workloads into each node and that’s going to let it run fewer nodes in the scale group will automatically get smaller. And I’ll talk about that in a second.

I also want to very quickly talk about Memory. There’s not a lot of the yellow Memory in this example environment. In the previous section, I showed a different environment had a lot more oversized Memory, but you see the surplus Memory requests here. Here’s a case where CPU it’s interesting, it’s very well utilized, but Memory is not well utilized. We’re requesting 4 gigabytes and on average using 150 megabytes – so this is straight up straining memory. There’s quite a few here with pretty big straining. So we have an opportunity to carve back memory as well, but with the caveat that I don’t want to start doing that until I fix the red and the gray, and I described that in the previous section because it may be that that surplus that these ones have is helping these ones survive, because these ones aren’t asking for enough. So if you have something asking for too much and it gets scheduled beside one that’s not asking for enough, actually one can help the other survive. If you start downsizing all the yellow before touching the risk, then it can make the risk even worse. So there’s quite a bit of savings.

In this case, it’s about 10% savings, but CPU is about 30%. We see 60%, 70% in a lot of environments of savings opportunity. But got to follow the process of making sure that you get rid of that low hanging fruit risk first before you start reclaiming the resources.

So any yellow in the histograms is an indicator of an opportunity to save money. And I will go back over to the cloud side of the house for a second just to make the point that in the scale groups that these things are running on when you reclaim, I mentioned in a previous section, that these Kubernetes node groups get linked to the scale groups and scale sets in the cloud, and they drive the optimization.

So this one’s saying it’s optimal right now because it’s factoring in the requests and limits that are in the environment. As I start to downsize my, for example, CPU or Memory, this will start giving a different recommendation and I’ll say, okay, now that you’re not asking for so much we don’t need to keep that much on the floor, e’ll see that the aggregate request levels coming down. You see here there’s utilization and the aggregate request levels and as they come down, this will start giving you different recommendation. It will recommend to go on to memory optimize or reduce the scaling numbers and that’s where the real savings occurs.

I pointed out earlier here’s an example where you’re on a T3 2xlarge and we’re saying you can go into a memory optimize smaller with different scaling parameters, and you’re going to save some money. So you’ll start seeing recommendations like this as you optimize the container settings, then Kubernetes can schedule things onto nodes better. You don’t need as many nodes anymore and you’ll start to see savings coming out of this level of product, which is really where the dollars come from – it’s in the cloud bill for containers running in the cloud.

We also optimize containers running on-prem the same way, but you don’t see the scale groups, maybe showing up as bare metal or as VMs but the same Kubernetes optimization applies.

That’s a quick overview of the cost savings opportunity identified by Densify in these container environments.