Introducing Kubex by Densify Demo Product Tour

Kubex is a game changer.

Understands Full Stack: Optimizes safely from container to pods to the nodes they run on
Delivers Realizable Gains: Recommends only what will improve performance and cost and omits anything that won’t
Reliable Automation: Acts directly or connects with your preferred methods and systems

See what Kubex can do for your Kubernetes environment. Watch the 25 min on-demand webinar.

[Video transcript]

Good day or good evening to everyone. Thanks for joining us. Our session today is all about introducing a new product by Densify called Kubex. We’re excited to share a little bit about the product as well as give a full on demo and show you what we’ve been working on over the last number of months. Highly influenced by some key advisors that come from our customer base. Some large infrastructure owners and managers which have driven a lot of what we’re going to show you here today.

Andrew, I’ve got two or three slides, and then we’re going to spend the majority in product.

Kubex is all about optimizing Kubernetes infrastructure. It’s really focused on allowing organizations to slash the cost and the effort required to maintain and optimize these increasingly complicated environments. The three key areas that we Like to call it as differentiation because there is a growing amount of open source as well as commercial software in the space around.

Kubernetes key differentiators. What we have built. Is importantly fully informed by the full stack that exists within a Kubernetes environment that yes, the individual containers that run the important software that you manage sit within complicated structures and Share infrastructure right down to the node.

And so that’s extremely important. And Andrew will elaborate on that because we do that we’re allowed or enabled to do what we call deliver realizable gains. And that is that we will only recommend things that Make sense in the overall context of the infrastructure that if we tell you to do something or that you should do something, it’s fully informed by what that element might be sharing with infrastructure wise and downstream effects.

So, do no harm. We’ll only tell you things that will reduce cost and or improve performance or get rid of risk. And then the last piece is increasingly important. And that is that if we tell you about a bunch of things that should be done, it should be easy to get those things done. So automation is very important.

And Andrew will talk a little bit about the options there because they are rich. That’s some key messages. What drives us with respect to building this software is a growing set of demands in our cloud customer base as more and more of the budget in these organizations is being consumed by Kubernetes.

And the challenge that a lot of organizations have is that what’s inside that Kubernetes is. Infrastructure is often a blind spot with respect to cost and risk and scaling and controlling resources because the things that are being used to manage cloud budget often categorized in the FinOps.

Space from a software perspective, they don’t peer inside and understand what’s going on in the Kubernetes layer. So that’s really what we’ve done here. And that is addressing this two sided problem, which is yes, the cost. Or potential waste that might be in these environments, but also the risk or sensitivities , from a stability point of view that you have to deal with before you can go and deal with efficiency.

So this is what Andrew’s going to talk through. , I’m going to shut up now and I’m going to hand it over to Andrew. All right. Thanks Chuck. And thanks everyone for joining. I’m just going to layer on one more thing here and then get right to the demo because this is a demo. In addition to what Chuck’s just mentioned about the risk versus cost kind of tension.

I want to reiterate the full stack point he made. So if we look at, challenges of resource management. Or, you know, traditionally we call it capacity management. , you need to have enough, make sure you have enough without having too much. And there’s various personas and teams involved that have various interests in this area and including platform owners and not only the different teams, but there are different structures within Kubernetes that make this quite difficult to do.

a key point here , is that I can’t just rush in and start downsizing things in a Kubernetes environment because I don’t know enough about it to know if it’s safe to do. It’s not like a cloud instance where. You’re using half a cloud instance, you make it half the size, you keep going. I can’t go and make my nodes half the size , in a creative environment because it all depends on what’s being used.

And the cost is coming from the infrastructure we’re running. So the scale groups, the carpenter nodes, the bare metal VMs, whatever it is, that’s what’s costing money. But how many of those are running is driven by what the app teams are requesting, which may or may not be correct. And that’s what I’ll focus on quite a bit in this webcast.

So. I can’t just go and rush and downsize the bottom. I can’t even really go rush and downsize the top, either, without being very careful. And all of this is kind of intertwined. So, everything has a knock on effect, and I’ll try and cover that when I go through. Again, you just can’t go in and do isolated changes in this environment without understanding the whole thing.

And that’s why we use the phrase full stack. So that’s exactly what I’ll show. I’ll start with analysis of the containers. We do a very deep analysis of all the containers to tell you if they’re configured correctly or not. We also do a very deep analysis of all the nodes and not just the nodes themselves, we analyze the instance types against the cloud catalogs to tell you, are you running on the right instance in that cloud?

So very deep analysis. That’s all intertwined. So we won’t recommend a container change if certain conditions exist on the nodes that make it risky. We won’t recommend a container change that won’t have an impact on anything. We won’t recommend no changes if the containers are configured a certain way.

So it’s very important to get all of this correct and not just rush in and say, Okay, I’m gonna downsize the memory on this container because that could actually have a very bad impact. So I want to start with that. This is just a background on this. Chuck mentioned we worked on this

with advisors, we did a lot of functional and analytics enhancements, and then we kind of redid the entire user experience. And that’s what I’m gonna go through today on this. And we call it Kube. And first of all, I’ll start off with, like I just said in PowerPoint. There’s containers, there’s nodes.

I’ll show you a bit of both of these. I’m in the containers right now, and I’m looking at a visualization of all the containers I own. Now, I’ll give a little caveat that I’m only running against a lab environment here with about a thousand containers. We have customers running this on hundreds of thousands of containers, and this picture will show you all those hundreds of thousands in one view.

I can’t demo that to you, but I can show you one of our lab environments. And what I’m looking at is, in this case, the entire environment. And on the left here, just quickly to go through the navigation, These are my different clusters. I can click down to the clusters and see this visual, which I’ll explain in a second.

I can see the risk and waste across all my environments individually. I can go into namespaces. I can go into anything I want. I’m just looking at everything right now and getting a visual of what the situation is. I can actually build custom views on this left hand side to break down if I want to see my business unit, if I want to see things like prod versus dev, I can actually make customized views.

And I can also build filters saying, you know what? I don’t want to see everything. I don’t want to see kube system in my results. And if I do that, I’ll see a different view. Everything outside cube system or only show me my engine X containers or only show me production, whatever that case is. So, I won’t spend long on this today.

But the views and filters, if you’re interested on our website, there’s a longer video that goes into how to configure these. I can actually go and create entirely new filters. It’s like a query language to say, What do I want to isolate? That’s of interest to me, but for the purpose of this demo, , I’m gonna show everything.

So these are, this is everything. This is a cluster. This is a namespace, a pod and container, and I can kind of view that through this whole tree. I can even search it, which I’ll come back. Innovate. Let’s take the whole thing and talk with the right hand side, which is really the important part. And let me describe how to read what

this histogram is saying. So we’re analyzing all the containers, all the utilization patterns. We’re doing machine learning on all the raw metrics coming out of Prometheus and node exporter. And we’re analyzing and saying, are these containers configured correctly from a request and limit perspective? So the top left of the CPU requests the top right of the memory requests, then the CPU limits and memory limits.

And the way you read this, is that it’s a histogram. Tell me how many containers are okay. How many are way too small? How many are way too big? How many don’t even have values? That’s the gray. So the gray means you don’t have a value. Red means that you’re not asking for enough from a CPU perspective.

You’re only asking for one CPU and you’re using five CPUs or 10 CPUs. That’s what it doesn’t mean down here. You know, in this environment of a thousand, but 136 are correct. This is very typical by the way, if we pointed any environment production, , even in the most sophisticated environments.

We’ll see this pattern where some stuff’s okay, but in this case, the vast majority of the containers are way too big from a CPU perspective. this is where the cost savings comes in. , if you think about what we do and the Chuck’s point earlier, , there’s FinOps and there’s SRE.

So FinOps loves this yellow bar. Well, they should hate this yellow bar because it means they’re wasting money, but it’s an opportunity to save. A lot of money and I’ll drill down on that in a minute. I can click on any of these things and get all kinds of details, which I’ll go to in a second. So the way I would read this is that we’re stranding resources because we’re asking for too much and we’re just not using it.

And so that’s a very common problem we see. That is a big root of the cost to spend problem in most environments. Memory tends to be a little more complicated. So you see in memory, we do have a big yellow bar. So we have a lot of things that are way too big for memory. We also have a lot of ones that are too small.

And a lot that don’t have a memory request value at all. Even outside kube systems. , this little bar down here might mean I have containers where I asked for a gigabyte of memory, but I’m using five at runtime. Or I’m using four at runtime. This means I didn’t even ask for anything.

And for proper application workloads, it’s a good idea to ask for it because Kubernetes won’t know that what you’re doing there, it won’t know that you are about to use five gigabytes, and it won’t earmark that capacity for you on the nodes. And when you use it, you might blow up the node. And that causes the nodes to hit 100 percent and you get out of memory kills.

So when we see this pattern, there might be cost savings overall. But what happens is Kubernetes can’t place them properly. You end up with some nodes that are overstacked, other nodes that are understacked. And you end up when they hit the roof kills, which is what we’re showing here So we actually surface that and say you know what you actually have out of memory risks in this environment because your requests are all wrong So cpu requests are wrong wasting a lot of money memory requests are wrong Creating a lot of risk and possibly wasting a lot of money But you want to fix that risk first.

I don’t want to start downsizing containers in these environments if I already have nodes that are running out of memory and I’ll come back to that in a minute. So CPU memory requests and the memory limits. We don’t worry too much about CPU limits. A lot of our customers don’t set CPU limits anymore.

But under memory, you’ll see we have a whole lot that don’t have memory limits, which is a problem. That means that if you have a memory leak, you might blow up a node. But this little red section here is particularly evil. These are the ones. That the limit’s too low and when the limit’s too low the linux kernel will kill you and we are actually detecting that happening.

So we look at the restarts We look at the memory working set. We do a lot of analysis We look at the exit codes of the containers and say hey You’ve got things hitting their limits and restarting and this is important because a lot of restarts occur in an environment It’s a bit noisy to look at restarts.

This is really qualifying to say Nope, you’ve got restarts because of memory limits. And so, if I look at these three areas, again, there’s a FinOps value, there’s an SRE value usually FinOps is the one that gets initially interested because they’re trying to save money, but when we talk to SREs, This stuff becomes extremely important and usually is fixed first, because there’s no reason why you should have, you know, we’ve seen DNS servers getting killed because they’re hitting their limits.

There’s no reason to have that happening in your environment. So that’s the histogram. So at a glance I can see my entire company status or I can see individual clusters. I can do a comparative analysis of clusters. What’s Europe look like versus the U. S. as an example? And in one click I can get down to the top ten lists and how to fix it.

So let’s do that. Before I leave here though, I’ll show you, there is actually another summary here that lets me just rank things by the amount of risk and amount of waste. At whatever level I’m at in the tree, I can see rankings of things. There’s just different ways we can navigate around.

I’m not going to spend too much time on this because we don’t have time to show everything. But I’m going to go back into this and drill down. So, again, picture if you will, this is your whole company. You could be hundreds, thousands of containers. I can just go here and say, look, show me the worst problems.

And then I get down to this, we call the AI analysis details view, which is showing me my top 10 list or my top end list of worst offenders from a memory perspective. before I drill into that, this table is all new in our product and it behaves a lot like Excel. I can do almost anything I want in this table.

I can choose what columns I want to show. I can show all kinds of analysis details. I can filter sort and different things. And I can save these views so I can go in here right now and say, I only want to see the things that are called, you know, engine X that are running in prod and I’m going to sort this way and I can save it as a view and you can access these views here.

So I can create my own private views. I can make shared views. We build in a whole bunch of system views here, which is what I’m showing you now. And so there’s a lot of built in reporting here to show me all the different factors. Basically everything in that histogram will automatically drill down to one of these reports So anywhere I can just click one click i’m into my top 10 list for that particular area memory cpu Limits requests whatever the case is.

i’m looking at memory limits and the way you read this top row This is a container That has a memory limit of 96, and we’re recommending it be bumped up to 146, because it’s hitting its limit. And I know that because of this big yes over here. So we’re doing analysis on the back end of the memory working set, and a bunch of different factors to say, Yep, this thing is hitting its limit, and it’s restarting, in this case, 74 times in the last day.

This is probably a problem. Containers use architected to survive some restarting, but maybe not that much restarting. And so this is it ranked by how bad it is, and the bottom here, you can see all the various curves of the CPU memory, all the different stats, the restart metrics. And the way I look at this is if I can pop these open and say, Okay, let me see exactly what’s going on here.

So this is what we call the ML model. It’s the 24 hour pattern model of that container. It’s memory utilization. So this is going back through a lot of history. By default, we look at 95 days and say, Let’s learn the pattern of this container to understand what it does. And in this case, it’s going as low as about 22 megabytes.

I can click here and see the actual numbers. It’s going as low as 22 and as high as 95. That’s that thin top part there. And it’s spending half its time in this blue zone. It’s what we call sustained activity. So this thing’s memory is going up and down the range that we see here. And unfortunately, the limit is right there.

And so you can visually see it’s hitting its limit. In fact, I can turn these other metrics off just so I can see it more clearly. It is hitting its limit every hour. And this is a problem, because if it does that, Linux will restart the container, and if I flip over the restarts, you can see it’s restarting three or four times an hour.

I can very rapidly see what’s going on here. I can get a view of this container, what its problem is, remember, now I’m one click in from my whole company. I can get to this one very quickly, or the second one in the list. And see exactly what’s happening. And then, talk a bit about automation towards the end.

But of course, the goal will be to fix this by increasing the limit. quickly before I leave here this is the 24 hour pattern model of this thing that we learned through the analytics, and it’s kind of what drives a lot of the recommendations. I can also go back and say, show me everything this did.

You know, in the last three months. This is the pattern of activity and the request and limit settings over the last three months, every day, or even every hour. So I can go back and zoom in on an hour a day. Let me just drill in here. What did this thing do on, you know, two weeks ago, Tuesday at nine a.

m. So you can see there’s a lot of control here. So it’s kind of like we don’t compete with Grafana, but we obviously focus on our analytics and giving answers that Grafana doesn’t, but right in the product, I can click around in here. I can even see the raw samples coming in. From Prometheus.

So all this is available very rapidly right from here. I see the problem. I see all the stats. I even have a home page for this container that tells me all this stuff. It is the container. It’s hitting its memory limit. It’s also on a bad node, by the way. That’s two problems. All the recommendations, all the stats.

And I can send this as a hyperlink to app teams. , that’s a new capability we have where everything I’m showing here is deep linkable. So if I find something that’s interesting, I can just send it to the owner and they can act on, they can interact with these curves the same way I’m doing right now.

So that’s kind of 11 step down through the stack for memory limits where I’m saying, show me the risks. Show me the top problems. Tell me how to fix it. Show me why it’s a problem. Let me vet it. Let me make sure I trust the analysis. And the ultimate goal is to fix it

we do have several ways to automate. We have customers that tie it into their, pipeline, hit our APIs and update their templates every time they deploy. We have a terraform module that will automatically make this be the right value. We now have a mutating admission controller, which is very nice.

So instead of having to say, change the manifest, it can just override it as it deploys, and you can selectively choose which containers you want to automate, and it will just fix it on the fly. So that’s the latest one, and that’s really useful, because it makes it very easy just to fix these problems automatically.

In addition to that, Andrew, the question was in an infrastructure’s code environment, where you’ve got. Prod and dev if you have a detection in a dev environment based on load and do some right sizing that might not be applicable to prod, how do you mitigate that challenge?

difference between the two environments? Well, we would give different recommendations for both. So that’s the manifest level. We go downtown is the common set of servers. configuration. So each line item in here is a manifested entity. And so, for example, these ones could be replicated hundreds of times, but they’re still running off the same launch configuration.

Dev would typically show up with a different recommendation for a different environment. . Now, there are ways to say, I want to use my pre prod recommendation for prod, and you can set up policy to sync that way. But by default, they would be giving several recommendations because usually they behave quite, different.

So that’s the scope down to here. We would scope the dev environment different than prod from a recommendation perspective. Let me go back up. I’m just watching the clock cause only have a few minutes left, so that’s an example of a high risk environment. I’m going to quickly go into here and show you another kind of risk.

And these are ones where. I didn’t ask for enough memory, and so Kubernetes doesn’t know I’m going to use that memory. Here’s one where I didn’t ask for any memory, but I’m using about 8 gigabytes. So we’re 7. 99 gigabytes too small. I can see that in the curves down here. But if I go to the right, this is where you get to that node level information I was alluding to.

We know what node group this is running on. And we know that that node group is actually out of memory. The nodes are, actually in this case, 100 percent of the nodes in that node group are out of memory. Memory utilization is a constraint. So, this is probably causing restarts.

This one’s not restarting, but it may not cause this one to restart. It might cause somebody else to restart. And, so this is a very directly actionable recommendation that’s prioritized by, do these first, because You’re over stacking these nodes and that’s where if I go over here into the node group, I’ll go here very quickly to show you I can see all my node groups.

I can see whether they are actually over stacked or out of balance. I don’t have much time to go through this. So I’ll just kind of show you quickly right down to the node level. So I can say at the node level, tell me which nodes are the busiest. So these top nodes are pretty much hitting the roof here.

If I pull this open, these ones are hitting 99. 5 percent utilization at certain points. When that happens, there will be out of memory kills. And the problem here is because I’m not requesting enough memory in my manifest. So these are getting over. Some of them are understaffed. Some of them are barely hitting 50%.

This is what happens when the requests are all wrong. The nodes get stacked funny. Kubernetes doesn’t know what you’re gonna use, so it tries its best. And then you end up having runtime problems where some of the nodes overstaffed. That’s exactly what I was showing in here, in the memory. This is being caused by the memory requests being all over the place.

And Kubernetes tries, but it ends up causing overstacking. So lots of risk in this environment. It’s a lab environment, but every environment we see has these types of risks in it. so they’re usually important to clean up first before you start tackling savings. Because if you try and downsize things right here it might make the problem even worse.

So that’s risk. Let me show last but not least, because it’s probably the most important one from FinOS perspective, certainly. All right. This yellow area are the ones that are wasting a lot of resources. So these are the top offenders containers that are over configured from a CPU perspective.

This first one, it’s giving 1, 400 millicores. It’s about 1, 380 millicores too big, and there’s two of them running. So it’s wasting almost three CPUs just as one configuration. In real cover data, you’ll see hundreds of replicas of these things. You’ll see highly replicated, way oversized containers. You know, this one has been given a CPU, but it only needs about half a CPU.

Same with this one. It’s given a CPU. It only needs about a fifth of a CPU. It’s 800 millicores too big. And again, all of this stuff I can see down here again, I can see my current request is a thousand and my recommended request is under 200 because this one just isn’t doing very much. So these are the important ones from a cost savings perspective, because if I action these quickly, I can start shaving down how many nodes I need to run on because I don’t need all these resources and the last thing I’ll say is.

Now, that’s true, but it may not be true. And let me just, let me, sorry. This is what Chuck mentioned as far as realizable gains. This one’s too big, and I can make it smaller. I can make all of these smaller, but the question becomes, is it safe to do, and will it make a difference? So that’s where, if I scroll to the right, I’ll see again, I’ll see that node data here.

So we know what nodes these things are running on, and we know if the nodes are healthy. So this top one is good. It’s too big. It’s on nodes that aren’t out of CPU, so it’s safe to downsize. We wouldn’t want to downsize if this number is non zero. And CPU requests are the primary constraint of my node group.

So this is very important. Is it safe to do? Yes. Is it going to make a difference? Yes. So if I take that action, I will immediately save money. This third row down, it’s safe to downsize this one, but it’s not going to make a difference because CPU is my constraint. I’m out of memory, so I can downsize CPU all day long.

It won’t make me save money right away. It will make you save money if I reconfigure the nodes, which we also do. So free to take that recommendation. I just shouldn’t expect it to be cheaper tomorrow. If I take enough of these recommendations, all these ones down here, eventually I can reconfigure my nodes to be memory optimized and I’ll save a bunch of money.

Chuck used the phrase realizable gains. If I go back up here, this shows me all the oversized ones. This one showed me all the ones. That I can take safely and we’ll save money today. Now there’s only one in this environment because it’s a lab, but in a real environment, this is a very long list.

This is the top 10 or top and tackle. So I want to cover that very quickly. It’s very important concept. Anybody can throw out recommendations to make things bigger or smaller. What we’re saying is no. These are the ones that are safe. Are effective and automatable. So you don’t wanna automate off the full list because you don’t know if it’s gonna be safe.

You wanna automate off the safe list. I’ll stop here ’cause we’re outta time, but that’s just a quick run through. Of all the areas of value, this is cost savings, risk mitigation, all has to be done together all in one place. And then of course, drive automation. From that, we have an upcoming webinar on auto mode and automation and our rotating mission controller.

So I think this coming in a few weeks. So I’ll stop talking now and you can join that one if you’re interested in automation. Thanks, Andrew. Thanks for the demo. I’ve got a couple of concluding slides. Probably the most important thing, though, is the fact that you could deploy this very quickly in a trial, free trial that we offer for 60 days.

allows you to, in a read only way, we’re not automating or changing your infrastructure. The trial allows you to see what those histograms would show you about your infrastructure. So it’s a very, very low risk very little downside to the trial offer. So we encourage you to take us up on that.

We love to show what the potential is in a prospective infrastructure by actually seeing it in action. We are at a couple of events coming up and we do a few more webinars. Andrew mentioned us talking about auto mode and some of the native Amazon and in fact Azure has some auto capabilities increasingly.

And how these capabilities relate to those automated systems that you might be taking advantage of. We will also be at KubeCon in Europe in April. It’s in London. So if you’re in Europe, please come and see us that resource on the left, 12 risks is kind of a neat summary visual summary of the issues that Andrew was talking through that show up in those histograms.

So you see the little histogram to the left, and then a description of why that happened you can download that. It might be useful for talking to your co workers about , these types of issues. Thank you all for joining us. Thank you Andrew for the demo and please take us up on either asking more questions through our contact form or saying that you’d love to trial it.

We’d love to set you up to see this value. So thank you all. Have a great day or a great evening.

Let’s get started on something great.