Kubernetes Resource Control
Part I: Navigation and Visibility
See the value of AI-driven analytics
Video transcript:
Hi, I’m Andrew Hillier, the CTO and Co-Founder of Densify, and today I want to walk through our Kubernetes Resource Control. I’m going to do that in three sections. First, I’m going to talk about just generally how Densify works, how you navigate the user interface, the kinds of data we acquire, and the kinds of answers we generate; get down in some deeper level workload metrics viewing and talk a bit about observability and how all of this relates to and augments your observability solutions that I want to get into identifying risks in the Kubernetes environment. We think this is very important to understand if there are any containers at risk, especially of being killed due to memory issues, so I’m going to cover that in some detail. And then I’m going to talk about waste and cost savings and how to identify savings opportunities and stranded capacity in a Kubernetes environment.
Let’s start with the first section. First of all, just a high level view what Densify looks like. right now you’re looking at the top level screen for Public Cloud. And the reason I’m showing you that is that we also analyze and optimize all the major cloud providers as well as Kubernetes, which I’m going to cover today. But Kubernetes often runs in these cloud providers. And so the scale groups, for example, in this demo, I’ll come back to this because that’s where the Kubernetes is. No groups are running, and that’s part of our analysis.
We analyze the cloud and the containers all together to give seamless end to end optimization. But right now, I’m not going to talk about cloud.
I’m going to go over to the container area and go into our Kubernetes main page. And what you’re looking at here is just a high level dashboard of the environment that’s in this system. Now, a caveat; this is a lab system, so not all these workloads are real. I’m going to go through that and I’ll show you some real workloads in a minute.
But first of all, just to orient ourselves, we can filter down. Across the top here on different parts of the environment. So I can look at just certain clusters or namespaces or controller types, and I can add filters and we can ingest labels and tags and give all kinds of filtering, and I can do that anywhere in the UI. So I’ll just give that a bit of coverage up front here. I’m in a summary page. You see, this environment has about 1,700 containers in running in 1,500 pods. Many of the pods are one container per pod, and they’re being started by 1,600 different launch manifests. So that’s important because that’s the point where we optimize these things. That’s where you change the numbers in the manifest. That’s how you optimize the environment. Many of them are singletons. Some of them are running multiple containers in a pod, and some of them are replicated and I’ll show you that as I drill down.
We have a summary of the optimization opportunity. I’ll get more into this. In this case, about 35% of the environment is good from a request and limit perspective. So when we look at whether the containers are specified properly and the resources are specified properly, about a third of the environment is good. We have some that need to be made bigger, some that need to be made smaller.
The pink resize area, those are ones that are both too big and too small at the same time. So containers have a lot of settings, and they can be wrong in all directions. So some containers are too big for memory, too small for CPU, or the limit’s too low, the request is too high. And then the purple area are ones that are missing specifications. And again, it maybe in your Kube system namespace, you don’t necessarily set limits on things. But for most production workload, do you want to have a request and limit value especially for memory, or at least a request for CPU. Some of these also need values, but they don’t have them right now.
At the bottom is a trend of how well we’re doing as far as optimizing the environment. We started off not so good. Not much of the environment is green, and we’re getting more and more green. And the far right is the impact summary of the optimization in resource terms. And at the bottom, is an overall characterization of the environment in terms of how many – in this particular example, there’s 189 cores of requests out there. And we’re saying that the optimization – we should bring it down to 125 cores. So there’s significant optimization opportunity. So the aggregates, which are also kind of listed across the top as far as the general opportunity.
Now, I’m going to drill down from here and from here, there’s two ways you can go and we have some customers that like working in tables; if you like Excel, you’d like to see things as tables. We have some very powerful tables, but we also have visualization so you see a pretty picture of what’s going on, and you can go either way.
I’m going to start with the tables. So if I go to this data tab, what you’ll see here is the environment broken down now in this case by cluster. And I went through all the filters at the top and I can filter in, but a new thing appeared at the top, right? And that’s the “group by” so I can group by different things. I can group by “my namespaces”. I can look at things by “business services”. So for example, I can look at the namespaces in this table. I can look at different tags, I can look at the departments, whatever is populated in the environment and see the summary and what we’re looking at here is the columns of the data are the overall numbers of containers. For example, kube-system has 465 manifests in it, 585 containers running and we’re getting a view of how many things are too small or too big , or have risk, have opportunity, need values in this case, that namespace.
When I click on a row, for example, the kube-system namespace, the bottom will give me a breakdown of all the clusters that namespace exists in, which in the case of kube-system , is all of them and you can see here a breakdown of per cluster, the status and the resource status; and I won’t go through all the columns here, but if you like looking at it in a tabular form, you can get a very powerful views of by namespace.
One of the main ones we work on is by “cluster”. It’s a good starting point. These are the major clusters in this environment. Now, from here, I can drill down into these numbers here and get to a deeper level. I can look at the entire environment or I can look at a sub piece of the environment.
Let me go into one of these clusters. Let’s say this QA cluster AKS and if I click here, I get to that visual I was talking about. So this is the histogram. And I can look at this for the entire environment or a cluster. I can look at it for just a container type or a namespace. This is a visualization of whether that, in this case this cluster, is specified properly or not from a CPU and Memory perspective and you can see these four curves. The top left is the CPU request. The top right is the Memory request and then the CPU limit and the Memory limit and the bars are telling me how many containers are too big or too small.
The way you read this is that this big yellow bar here in CPU, in this environment is just a small cluster but you can picture this running on tens of thousands of nodes. We run it regularly on very large environments. So in this case, the vast majority of the containers environment are way too big. They’re in the 500% bucket, meaning they have 5X more CPU requested than they’re actually using. The green means it’s just right. Yellow means you’re too big. Red means you’re too small. So some of the containers are underspecified for CPU, but most are over specified and that’s usually an indicator that there’s a cost savings opportunity.
We see this yellow bar quite a bit, meaning the CPU specs are too big. And if you bring them down, you’ll unstrand capacity. I’ll talk more about that in a bit. The Memory request, you see, this is a little more nuanced and we see this all the time. It’s more of a spread, where, yes, I have things that are too big, but I also have things that are too small, in some cases way too small – this is very typical, where usually there’s a lot of waste in CPU, and Memory is a mixed bag of some things that are too small and some that are too big, and that can create high risk, in this environment, you’ll see there’s actually a memory shortfall. There’s 19% shortfall, even though there’s a big CPU surplus, that’s the yellow 72%, there’s a shortfall of memory. So, we’ve got a problem here. We’ve not given high enough request values to our containers and they’re being overstacked, or we didn’t give a request value at all, in which case it also caused it to overstack.
I’m going to drill more into that in the risk section and talk about how that can be fixed, about the problems that causes and how that can be fixed and then the Memory limits. You see in this cluster, we have some limits in the red. That means the limits should be bigger. And if you see a lot of red, especially in the middle here you could have things getting killed because they’re hitting your Memory limit and I’ll show you that as well. And of course, the gray in the Memory limit means you didn’t give it a limit. That can cause problems because if you have a memory leak, that container can grow and grow and take over a whole node and we’ve seen that happen as well. This is the pictorial view of what was on that table on the previous slide.
At a glance, you can see , it’s like an MRI for your container environment. In this case, we’re seeing, yep, we’ve probably got a cost problem. On the CPU side, but we’ve got a risk problem on the memory side, and that’s what we see everywhere, every customer we analyze looks a lot like this. Maybe more gray in some, maybe more red in others. And CPU limit, I didn’t talk much about; we find that some customers don’t bother setting limits in CPU, which is fine. We’ll give recommendations, but it’s not the end of the world if you don’t because you’ll just get throttled – why pre throttle yourself when you’ll get throttled if the node gets busy? We give recommendations for all of these CPU limit, maybe, you don’t need to take, but the rest of ’em , are very important. So that’s the histogram view.
Now, from here, I can go down into even more detail and I can click on any one of these limits and I’ll get a view. For example, if I go to Stranded CPU Risk, this will take me down to a per container view. So if I drill down into this, I will see in this case, a view of all the ones that have the highest CPU surplus. And you can see here there’s some very spiky workloads that have, in this case, we’re requesting a whole CPU when it’s averaging 1.5 millicores with some very transient spikes; you can set the policy how you want to do those spikes, but generally this one probably doesn’t need a request value that high. If I scroll down here, here’s an interesting one, this Otel collector, this one is running around 200 millicores all day long, peaking up as high as 300 millicores. So again, that’s the pattern model is showing us what the pattern of activity is on this, but we’re requesting a whole CPU. And again, that’s probably wasteful; we’re recommending you bring the request value quite far down because this is just going to strand a lot of capacity if this thing never goes above 300 millicores.
But again, you see at the bottom of the screen here, these are the pattern models that the machine learning is determining. We build different models. You’ll see cases if it’s a replicated in this case is a good example because there’s two containers running, it’s replicated twice. We’ll see the busiest container and the average of them as well. And so you see there’s so many CPU curves here because we have: existing, recommended, the average across them all. The existing is the busiest and then the average for CPU and for Memory and we analyze all of that to figure out what you should set the request limit to. Here in this case, the busiest of the two is hitting almost 3 gigabytes. The average though, is 1. 5 gigabytes. To come up with the right requests and limits, you need all of these models so we build up some very detailed models of what these things are doing and how many replicas are there right now.
Now, we can go even deeper in this, again, this is the machine learning model of the patterns that are used to drive the analytics. But we can go, if you see this little link here, we can go down deeper into the Metrics Viewer and we can see the raw data being collected on this container. So again, we can see in this case, this is the CPU utilization of the busiest container at peak versus the Current Limit and the Recommended Limit. So the Current Limit is way up the green chart. We’re recommending you bring the limit down. Here’s the Memory limit you can see there’s a bit of a ramp here.
This is the raw data. I can go back and look at any time frame that I want to, hourly or daily. So again, this gets into that observability capability, where right in the solution, you can go and start to see this depth of data. You don’t need to flip tool to another tool. You can navigate to it right from the recommendation. So we’re saying. If I pull up the metrics again, I have a variety of metrics. I can look at Working Set, RSS, Total Memory, different CPU. For the busiest replica, the typical replica – so a lot of visibility into what’s going right down at the bottom level. That’s the detailed raw data that goes into the analytics.
This screen shows us the machine learning algorithms and the models that we build for the analytics and how it determines what the recommendations are. I’ll cover that more in a minute as far as risk versus saving money. And if I keep going back up again, I can get to these tabular views and these visualizations of my environment in the Kubernetes world.
The last thing I’ll show quickly here is that you can also get a view of the policies. So all this is based on very detailed policies. And in fact, this is the not detailed view. If I show advanced, I get a large number of parameters that you can control. So how much data do I want to look at? What are my thresholds? How do I want to treat the peak versus sustain? Do I want a size to peak? Do I want request equal to limit? All of this stuff can be controlled very easily in our policies so you get the exact right answer, which is important. If you want to automate this, it has to be the right answer.
That’s a quick walkthrough of the container side of the house of this. I’ll go into some more detail in a minute, but what I want to do is go back to the Public Cloud side for a second. I’m gonna go to Amazon in this case, and you can see here’s EC2 RDS, ASGs and all the recommendations. I’m going to go to the scale groups because this is very important. In this case, the example I’m showing, these containers are running in Amazon in scale groups; the nodes and the clusters are sitting on top of scale group.
Here’s an example of – you see this node group, this is a test cluster and this is the test node group. This is the scale group it’s running and so we automatically link the Kubernetes analysis and the nodes to the cloud analysis = this is a scale group optimization analysis. But what that lets us do is inform this with all the requests and limits. So now when we’re optimizing the scale group, it’s not just looking at utilization it actually understands the aggregate requests for CPU and Memory. There’s the CPU utilization on the busiest day, but here’s some of the requests. So you can see the delta there means we’re over requesting resources and the same for the memory. The memory utilization versus the request values, network IO. All this is factored in, so it’s very important we can link the Kubernetes analysis to the scale group analysis so when you optimize the containers, it’ll optimize the node groups as well – it just happens automatically. And that one’s okay it’s not recommending anything there. But if I go back up to this one, for example, you see, here’s a case where we’re on a T32xlarge and running between 0 and 3 of the scaling parameters and we’re saying, well, you’re not using your CPU at all, you should go into an R6Axlarge and scale between 0 and 4. So it’s a smaller instance but you’ll get by with 4 of them, and we can even predict the scaling activity on that and give you more room. So it’s going to scale better, and it’s going to be more efficient, and if I scroll to the right, it’s going to save a bunch of money. Again, the bottom line is it links, the cloud scale groups to the container analysis. It makes sure you have enough capacity to meet all the requests for CPU and Memory, and as you optimize the containers, it’ll automatically optimize the nodes.
That’s a walkthrough of generally how Densify works.