A few years ago, we realized that spending in our AWS product test environment had jumped significantly from one month to the next. We drilled down into the issue and traced it to some RDS database instances that had been spun up to test new product features. No one realized that these expensive instances were left running after the tests were complete, and subsequently racking up charges for several months. Fortunately, we have a relatively small test environment and although we came close to doubling its cost over this time, we did not break the bank. However, perception is everything and management wanted the expenses brought under control. If only we had some way of detecting anomalies and dealing with problems like this before the expense had a chance to accumulate!
In a traditional on-premesis data center, once the data center is built, costs are consistent and available resources are constrained to what has been installed. Ongoing decisions focus on resource management and optimization—getting the most out of what has physically been installed—and is typically tracked by resource utilization.
In a cloud environment there is essentially no physical constraint, the environment can grow as needed and expansion to meet demand is literally a credit card away. While resource utilization metrics continue to be important, the ongoing cost of resources is now a significant measure of how we are managing the environment.
Being able to visualize this cost is valuable for a FinOps cloud financial management model, where multiple business teams to identify the best balance of cost and overall quality of service delivered. More importantly, cost visibility allows us to identify trends and anomalies in spending that point to savings opportunities with no significant impact to the quality of service delivered.
A trend is simply an ongoing spending pattern. Spending on EBS snapshots shown above has been steadily increasing over a two-year period.
Subsequent investigation revealed that this spending went from being negligible, to increasing to more than 10% of total dollars spent for this AWS customer over the time period. If this trend can be tracked back to poor management of EBS snapshots—in other words, the snapshots are not being deleted after an appropriate period of time—then a significant savings opportunity exists.An anomaly (or outlier) is a sudden change in spending that does not fit an existing pattern. The causes of anomalies are diverse. A test environment spun up over the weekend to validate changes for the production environment will result in a spike in cost as will the expiry of an Reserved Instances or Savings Plans subscription. Given the size of many enterprises’ expanding AWS footprints, these changes represent significant increases or decreases in spending.
The $2,000 jump shown between the 17th and 24th of August does not return to its original $2,000 average, meriting additional investigation to see if this ongoing, increased cost is justified. The goal of anomaly detection is to capture the change and remind the appropriate person to turn off the test environment at the end of the weekend before the change results in significant accumulated costs.
Outliers can be detected using a model similar to Bollinger Bands used in financial analysis. By calculating a moving average and the standard deviation about the moving average we can determine a volatility envelope or outlier threshold.
The moving average plots expected behavior. The standard deviation calculated from this moving average gives the volatility envelope (the outlier threshold) for acceptable variation of the actual billed amounts from this average. Observed cost or usage values that fall outside of this envelope are identified as outliers that deserve further investigation (highlighted and shown in red above).
Three settings are used to tweak this model, increasing or decreasing the variation and magnitude of the outliers detected.
These input parameters are:
A detailed tabular listing of the AWS billing line items that make up the outlier helps to quickly identify the expenses that contribute to a spending spike. This allows us to compare the increased spending against more typical spending and identify the cause by different attributes and user tags (linked account, AWS service, business unit, etc.), so that the outlier can be attributed to someone who can further investigate it and take any necessary action. An automated ‘diff’ of this detailed listing against a day or multiple days before the outlier allows us to identify the specific line items that are new or have changed, giving us better insight to the actual cause.
Now that we have a means for outlier detection how do we use it?
Attempting to manually review the bill and identify outliers would be time consuming and make it virtually impossible to capture all the outliers in a large environment.
Wrapping outlier detection in a report that captures the outliers for you in a given environment makes sense. This gives us a sortable list of outliers to investigate that we can then prioritize based on further analysis and filtering.
Just like for a specific outlier it makes sense to be able to massage the output for the generated list of outliers. Useful input parameters include:
A generated list of outliers is moving in the right direction but really what we want is automatic notification of new outliers delivered in a way that makes business sense.
Working with our larger enterprise customers we have automated the delivery of outliers using several delivery mechanisms including:
Delivery of outliers via API or S3 allows customers to further filter and qualify delivered outliers based on their business logic. They can then deliver the outliers via their own mechanism such as slack and provide links and images using our API to pull the detailed chart and line-item details for each outlier.
Ideally, we want to prevent wasted spending before It occurs. Amazon provides automatic mechanisms to help us avoid unexpected costs from occurring by automatically cleaning up after ourselves—such as automatic EBS volume deletion.
However, if something we have done is not automatically covered (or we did not set it up a cleanup mechanism in the first place) we want to detect variations in our spend so we can act quickly to avoid an unnecessary ongoing expense.
Anomaly detection allows us to capture and track the added expense of cloud resources that were intended to be transient, notifying us to terminate these resources if they were mistakenly left active.