Kubernetes Resource Control: Identifying Risk

Kubernetes Resource Control

Part II: Identifying Risk

Part I: Navigation & Visibility

Part III: Identifying Waste

See the value of AI-driven analytics

Learn more about Kubernetes optimization »

Video transcript:

In the previous section, I walked through how Densify works and a bit of orientation of how to navigate the user interface and how to find the data and the various drill downs. Now, what I want to talk about next is identifying risks, and for this, I’m going to go back over into that histogram view for the entire environment.

The main risks are, if you see here, if there’s any red in the memory requests or any gray, it means that you’ve made the request too low or you didn’t give it a request at all. And what happens is, when Kubernetes goes to schedule these workloads, it may over stack the nodes because it’ll basically keep putting things on node thinking it doesn’t need much memory when it’s actually going to use a lot of memory and to the point where it blows up the node.

So when we see red or gray here, that’s an indication that you could have out of memory kills or what we call pod termination risk. And I’ll drill down that in a second. Down here, this is a different problem. So we can see here that there’s a process termination risk due to memory limits. So if the limit is too low, then what’s going to happen is, if your working set hits the limit, you’ll get killed and so, even if you’re on a healthy node, the red and gray up here, or the red down here, are very important to us. And what I’m going to do is I’m going to drill down into the details now and show you what that looks like.

Here now, when I’ve followed that link, I’m now sorted by the ones with the worst offending Memory requests. So here you can see the CPU utilization pattern, that’s the pattern model from the machine learning. In this case, the CPU is also too low, we’re requesting 200 millicores, it should be higher. This is a case where the CPU is too low as well, we’re recommending to bring it up above that sustained activity. But if I scroll to the right, the memory is also way too low so this one’s using about almost 3,500 megabytes, but we’re only requesting 800 megabytes; this is a case where this one is underspecified. In itself, it might not be a problem. There’s only one or two of these, but if a lot of these exist and they get scheduled together on the same node, the node is going to blow up. You’re going to run out of memory on the node.

So here’s a good example of a case that’s underspecified from memory. It’s also in this case, in the red for CPU, it looks like it needs more CPU. So this is a straight up upsize recommendation. And you can see there are a bunch of them in here where for the workload, you need to give it more resources. You’re not giving it enough resources to handle the workload. Now, what will happen is these might not be the ones that get killed. What it will do is it will prioritize ones that are way above their limit or don’t have a limit. And in this case, if I look at the number of restarts, this is where the problem that happens is that these ones are all misspecified but it might be some poor other containers that’s the victim of this. So let me find a good one here. There’s a bunch here that have a lot of restarts. Here’s a cilium-operator, let’s pick this one. So this one has restart activity in the last day and in the last week, you can see it’s CPU utilization profile it doesn’t actually have a CPU request or limit so we should give it one. If I go to the Memory, you see here, it’s not hitting a limit. It’s not restarting because it’s hitting a memory limit, it just doesn’t have a limit at all, nor does it have a request so this one is at risk; because it doesn’t have a request it might be the first one to get killed; and it does have restart activity in fact, in these curves, if I scroll to the right. Here you can see the restart pattern over the 24 hours. So this one may be the victim of sitting on a node that’s running out of Memory.

So when we see a case where something’s restarting, it’s got potential memory pressure on the node. If it’s in node group that has misspecified containers, that’s a risk. And we see that all the time where things are getting killed. It’s not his fault –well, it doesn’t have a request, so it’s kind of its fault but it’s because there’s a lot of misspecified containers in this cluster. But if I look at the Memory limits, we also have red here. And so if I go down into the process termination risk – what I’m gonna do here is actually again, sort by restarts and look at, for example, this pme-server. This one it’s nowhere near a CPU request; it’s way too big from a CPU perspective. This is where it gets interesting , you see this curve here, this is request and limit are both set to 96 megabytes in this case. And it’s hitting its limit, you see it banging into its limit when the memory goes up and taps the purple line; that means you’re hitting your memory limit and we’re having a ton of restarts on this thing. If I scroll to the right in this case, there’s many restarts per hour, this is a case where this thing is hitting its limit – the limit is too low, straight up – it could be running on a node group that has loads of capacity, but it’s getting killed and we’re seeing the restarts.

Now, even worse can be the case where you don’t get restarts. So if we have this memory pressure here, where you’re hitting your limit, but there are no restarts, that’s indicative of what we call a hidden kill, where the container is hitting its limit, but a process is getting killed inside the container, and it’s not process ID number one. So the container keeps running, it’s just missing a process, and all kinds of weird things happen. And we see this out there as well, we call it “hidden kills” and when you detect it, it’s not on your radar – it’s not something you see in restarts or in any of the Kubernetes metrics. It’s just causing all kinds of havoc with the application.

So there’s two kinds of memory problems we see: one where the nodes run out of memory and one where you hit your limit. Both of them are pretty bad and we see both of them extensively in our customers.