Cloud Resource Control with Guardrails

Explainer video

Play Video

Densify Embedded Cloud Resource Control enables FinOps teams to intelligently limit instance choices for every workload to only those that make sense.

AI and machine learning analyzes each workload against the entire cloud catalog and selected policies unique to the organization or application to establish guardrails for every workload. 

In fact, it gives a very, very specific answer saying this thing you chose is actually more than two times this option, which we consider to be the optimal option. So very, very specific answers, but the beauty of it is it goes right back to the engineering teams or the app owners directly, not through periodic reports or change management tickets.

[Video transcript]

Hi, I wanted to talk about Cloud Resource Control. Let me describe what that is. I’m in the Densify console, and I’m looking at the Public Cloud Optimization top level screen. And you can see down the left. We have some Amazon, some Azure and some Google Cloud instances and services being optimized.

I’m gonna go down into EC2. And now this environment has quite a few EC2 instances across different business units, but I’m gonna go into one specific one here, a small one, and talk about these optimization recommendations. So you see here we have a bunch of recommendations. Some are to make things smaller, some are to make them bigger, different, newer, of course, you can see here there’s an M4 extra large onto an m7i flex, a new flex instance that’s one kind of recommendation. I’m going to focus on this one right here, an R4 2 extra large onto an R7 iZ extra large. So it’s going down in size, and you can see here the current cost is 313 a month.

But you’re going to save 94 a month just by doing this 1 recommendation. And if you look at the bottom curves here, interestingly, the current instance type is not quite big enough. You see that the workload curve, the 24 hour machine learning model is going above the threshold, but on that newer one, even though it’s smaller, it’s faster because it’s so much newer. So it’s going to perform better. And it’s going to save money.

And you see here,  this column shows the recommended instance type for all the ones that we’re analyzing. Now, if I were just to take that, that’s what we call Cloud Resource Optimization. I’m just going to take the recommendation and make it the best.

And that’s one way to go, but let me talk about a different way of doing it. And for that, I’m going to go into what we call the Catalog Map.  And the Catalog Map is a visualization of that instance being analyzed against the entire cloud catalog. Okay. Now, this isn’t actually the entire catalog. These are just the commonly used ones.

I’ll show you for a second. The entire cloud catalog is pretty big and complicated. This is Amazon and has a lot of weird and wonderful instance types and different sizes, and it gets a bit sparse in that view. But these are the ones that are 99. 9 percent used in our customers. And we can see here let me just describe what this is saying.

Across the bottom are the instance types. That are offered commonly used ones and up the left are the sizes and you see here. This one’s currently sitting on our 4 to extra large and it’s going 0 because it’s not actually big enough to host the workload based on policy. And we’re recommending it go on to an R7 IZ large or extra large.

Sorry. 1 size. Down.  Now, the way you read this map is that anything that’s red, it won’t fit on. So those are all too small. Not enough memory, not enough CPU, not enough network or disk resources to make sure that it can run properly. The orange are the things that are violating policy rules. So, it may be that you don’t want to run on something that old, or you don’t want to run on something that’s GPU enabled if you’re not using it, or Graviton, maybe your binaries  won’t work properly.

This will catch things like, do you have a local disk? And you’re using it. In that case, we need to make sure you go on another one that still has a local disk or SSD. Again, very extensive rule engine. Does your AMI change? Do your drivers change to make sure we understand what you can and can’t do technically.

Now, the yellow are the things that will work but they’re too expensive. And you see here, there’s a whole bunch that you’ll be fine putting it on an M7I48 extra large. , it’ll work. It’ll just cost almost 6, 000 a month and the green that scores are the ones that are within a spend tolerance policy.

So those will all work. And depending on how much you want to spend, they might be options. And let me just turn off all the scores to  focus in on this.  Now, that’s the case where  I have a spend tolerance. I said, you know, have some freedom, do whatever you like. Let me go down to the extreme and say, if you do whatever you like, money’s on option, this is what the map looks like.

And all of these things are fair game to put your workload on if I don’t care about money. Now, of course that’s ridiculous. Everybody cares about money, but you can see here. I can tune the spend tolerance. It’s like a giant knob. You can turn to say, what if I stop everybody from going more than 10 times more than optimal? and optimal in this case is this one.

So anything that’s more than 10 times, the best one is not a candidate. What about five times? But about two times? You see now at two times. I’m down to looks like about 15 different options that can run on. You can be any one of these. And it’ll be within policy. If I keep going 50 percent now, there’s 1 option that you can be on and so on and so on.

Eventually, till I get down to zero tolerance, meaning just make it the best one, and there’s one answer, and that’s the recommendation. So again, you can see here, this is like a way to control how much freedom everybody has. Because we find a lot of cases where you just ask the engineers or the app teams to take this recommendation.

When nothing else, that’s the only one. Oftentimes they say, well, you know, there, maybe I don’t want that one. Maybe there’s other options I want to take to consider. This lets you do that. And give them the freedom that they that they need to do their jobs while still making sure they’re not way off  into the red or orange or yellow. What we call guardrails. So this establishes guardrails on the resource utilization. 

If I flip over to PowerPoint for a second, let me describe how this how this actually works. So, again, as I mentioned, there’s an optimization paradigm where you basically just take the recommendation. And we have different ways of doing that.

We have an ITSM integration module that lets you open change tickets. We have a Terraform module that lets you just take action automatically. And that’s if you want to just make it the best one. And that’s a very popular way of doing it. But this new way is a bit different. It’s saying, well, what we’re going to do is embed guardrails in the deployment stack in the CICD pipeline using policy engines.

When you deploy, we’re going to catch things that aren’t in the green. And again, this is a very different way of doing it. The first way, the optimization would be, we basically take the data from the environment and we do machine learning on it. We run all our algorithms and all our policy engine and we come up with a recommendation.

And again, we can implement that in different ways. We can open a change ticket and attach all the evidence and rationale to that ticket. Enable app teams to make the change.  Again, that’s a very popular way to do it.  And that’s often kind of project driven or campaign driven to make sure things are efficient.

We also have a Terraform module. So, when we see an approval, our API will start answering back a new answer. And if you put our line of code in your Terraform, it will automatically propagate into the wild. It’s a very elegant form of kind of shift left optimization or automation. It’s great. For some companies that might be a challenge, it might be aspirational.

We do have companies, customers that are doing this wholeheartedly and it’s excellent because just make sure everything is correct all the time.  Now, if that’s challenging to do, or maybe you want to do a baby step before doing that, that’s where this resource control comes in. Because rather than just implementing one answer, what we do is we work with the policy engines.

So I’ll talk about Hashi Sentinel quite a bit. But it’s also we also support AWS config Azure policy. That’s a very powerful policy engine. Cloud Custodian can do this as well. And with these engines in the pipeline, what happens is it now, our API provides, it’s a new API that provides instant choices to these policy engines.

So picture it as we come up with all the answers and make them accessible to the policy engine. In the case of Sentinel, it just hits our APIs in real time of what you can and can’t do for each instance. So when I go to deploy a new instance, like an R4 large in this case, It might catch it and say, wait a minute now,  that doesn’t have a local disk.

You currently have a local disk and you’re using it. We see IO on it. We’re going to warn you if you try to deploy on something that doesn’t have a local disk. You might have performance problems.  Or not enough memory. We’ve been watching you at quarter end you get very busy and use a lot of memory. That thing doesn’t have enough memory. That’s the red zone of the map.  Or that’s too expensive. There’s a much cheaper option that will work for you. And this is where it gets interesting from a FinOps perspective because what this lets us do is enable policy based FinOps controls in the pipeline. 

So again, rather than throwing reports and opening tickets at teams, all you do is say we’re going to tune that spend tolerance policy and, for example, hey, app teams, can we all agree that, if you try to deploy something that’s more than 1.5 times the optimal we’re going to give you a warning. And so when you change that spend tolerance policy, the analytics updates and gives you new marching orders, and they are then available to the policy engines. And then automatically, anytime anything gets deployed, it’s cross referenced against the APIs and say, wait a minute now, you can’t deploy that, that’s way more expensive than your other options. And when I say you can’t deploy it, there’s different enforcement levels. Things like Terraform can allow you to stop – we don’t recommend that. Usually it’s just a warning. But that warning goes into the Terraform console. This is very important. So that directly communicates back to the engineering teams and the app owners that, hey, you just deployed something and it was not in compliance.

It was outside the fair enough spend tolerance. In fact, it gives a very, very specific answer saying this thing you chose is actually more than two times this option, which we consider to be the optimal option. So very, very specific answers, but the beauty of it is it goes right back to the engineering teams or the app owners directly, not through periodic reports or change management tickets.

It’s just basically right in the deployment tools.  And we think that’s really a kind of a game changer. We think it creates a new paradigm because on one hand, it empowers the engineers. You can jointly establish controls with the engineering teams and say, hey, can we all just agree that we’re not going to be really wasteful or do something really stupid?

But it’s still within that it give them the freedom to do whatever they want. So you do whatever you want. We’ll just warn you automatically if you go off into the guardrails. And I like to think it’s a lot like a virus scanner where.  Instead of sending someone reports to say, hey, can you please take these files off your laptop, just install a virus scanner and let it do it automatically. Everybody agrees it should be done. Let’s just automate it and make that check happen automatically.

So, engineers, their lives get better. They can, they have a lot more freedom. They’re not being hand pecked with all these reports, just their tooling will give them warnings automatically. And on the FinOps side, it really empowers FinOps because it’s not just a basic quota or spend limit. It’s a very, very intelligent analysis is there’s a deep technical analysis of all the workload patterns and all the policies and said, this is the best one for you across all those dimensions. And everything’s compared to that, so it’s not just an arbitrary limit or quota.

It’s actually a very intelligent. Financial analysis, and the warnings come from the stack, so you don’t have to send reports to people. You don’t actually chase people down and control them and try and get them to make changes. And it works regardless of how the workload is deployed. That engine is going to catch it as it goes out, no matter how you started it and it’s going to give the warnings  to the end user teams.

So again, we think this is a real paradigm shift in cloud optimization by enabling the next generation, which we call Cloud Resource Control.