How to Manage AWS Cost Outliers

calendar November 11, 2020

How to Identify AWS Savings by Analyzing Cost Trends & Detecting Anomalies

A few years ago, we realized that spending in our AWS product test environment had jumped significantly from one month to the next. We drilled down into the issue and traced it to some RDS database instances that had been spun up to test new product features. No one realized that these expensive instances were left running after the tests were complete, and subsequently racking up charges for several months. Fortunately, we have a relatively small test environment and although we came close to doubling its cost over this time, we did not break the bank. However, perception is everything and management wanted the expenses brought under control. If only we had some way of detecting anomalies and dealing with problems like this before the expense had a chance to accumulate!

In a traditional on-premesis data center, once the data center is built, costs are consistent and available resources are constrained to what has been installed. Ongoing decisions focus on resource management and optimization—getting the most out of what has physically been installed—and is typically tracked by resource utilization.

In a cloud environment there is essentially no physical constraint, the environment can grow as needed and expansion to meet demand is literally a credit card away. While resource utilization metrics continue to be important, the ongoing cost of resources is now a significant measure of how we are managing the environment.

Being able to visualize this cost is valuable for a FinOps cloud financial management model, where multiple business teams to identify the best balance of cost and overall quality of service delivered. More importantly, cost visibility allows us to identify trends and anomalies in spending that point to savings opportunities with no significant impact to the quality of service delivered.

Graph of EBS costs over time
Chart showing increasing AWS EBS snapshot costs over a two-year period

A trend is simply an ongoing spending pattern. Spending on EBS snapshots shown above has been steadily increasing over a two-year period.

Subsequent investigation revealed that this spending went from being negligible, to increasing to more than 10% of total dollars spent for this AWS customer over the time period. If this trend can be tracked back to poor management of EBS snapshots—in other words, the snapshots are not being deleted after an appropriate period of time—then a significant savings opportunity exists.

An anomaly (or outlier) is a sudden change in spending that does not fit an existing pattern. The causes of anomalies are diverse. A test environment spun up over the weekend to validate changes for the production environment will result in a spike in cost as will the expiry of an Reserved Instances or Savings Plans subscription. Given the size of many enterprises’ expanding AWS footprints, these changes represent significant increases or decreases in spending.

Graph showing a possibly AWS cost anomaly
Graph showing a potential AWS spending anomaly that appears to precede continuous increased cloud costs

The $2,000 jump shown between the 17th and 24th of August does not return to its original $2,000 average, meriting additional investigation to see if this ongoing, increased cost is justified. The goal of anomaly detection is to capture the change and remind the appropriate person to turn off the test environment at the end of the weekend before the change results in significant accumulated costs.

How to Calculate & Isolate AWS Cost Outliers

Outliers can be detected using a model similar to Bollinger Bands used in financial analysis. By calculating a moving average and the standard deviation about the moving average we can determine a volatility envelope or outlier threshold.

AWS cost outlier
Graph showing an AWS cost outlier that has surpassed the specified outlier threshold

The moving average plots expected behavior. The standard deviation calculated from this moving average gives the volatility envelope (the outlier threshold) for acceptable variation of the actual billed amounts from this average. Observed cost or usage values that fall outside of this envelope are identified as outliers that deserve further investigation (highlighted and shown in red above).

Three settings are used to tweak this model, increasing or decreasing the variation and magnitude of the outliers detected.

AWS outlier detection threshold, time window, and minimum cost
AWS spending outlier detection settings

These input parameters are:

The Outlier Threshold
A multiplication factor for the volatility envelope. This number is multiplied by the calculated standard deviation to increase or decrease the size of the envelope. The default value is 2.
The Outlier Time Window
The number of historical samples used to calculate the moving average. The default value is 10.
The Outlier Minimum
The minimum cost in dollars or usage value of an outlier. Outliers less than this minimum are ignored.

A detailed tabular listing of the AWS billing line items that make up the outlier helps to quickly identify the expenses that contribute to a spending spike. This allows us to compare the increased spending against more typical spending and identify the cause by different attributes and user tags (linked account, AWS service, business unit, etc.), so that the outlier can be attributed to someone who can further investigate it and take any necessary action. An automated ‘diff’ of this detailed listing against a day or multiple days before the outlier allows us to identify the specific line items that are new or have changed, giving us better insight to the actual cause.

Simplifying the Detection of AWS Billing Outliers

Now that we have a means for outlier detection how do we use it?

Attempting to manually review the bill and identify outliers would be time consuming and make it virtually impossible to capture all the outliers in a large environment.

Wrapping outlier detection in a report that captures the outliers for you in a given environment makes sense. This gives us a sortable list of outliers to investigate that we can then prioritize based on further analysis and filtering.

Amazon Web Services cost outliers report
Report showing AWS outliers

Just like for a specific outlier it makes sense to be able to massage the output for the generated list of outliers. Useful input parameters include:

  1. The 3 model settings already described above
  2. Multiple groupings
    1. By breaking the outliers down using multiple, user-selected groupings we are able to identify outliers at a more granular level and provide more meaningful detailed drilldowns of associated billing line items.
    2. Groupings include both the properties that are standard to any bill (account, region …) and customizable, user defined tags that are relevant to the business unit evaluating the outliers (owner, project, business unit, etc.).
  3. Filtering reduces the scope of the outliers identified. If I am looking for daily outliers related to usage then I minimally want to filter out the recurring monthly expenses and credits. The filters available match the groups that can be used when breaking the outliers down.
  4. Date range selection and time aggregation is also used to reduce the scope. Typically, if I am looking for daily outliers, I only want to consider the last full billing date. Conversely, if I am looking for patterns at a monthly level, I might aggregate the data by month and consider multiple months of history.

Automating Delivery of AWS Outlier Notifications & Reports

A generated list of outliers is moving in the right direction but really what we want is automatic notification of new outliers delivered in a way that makes business sense.

Working with our larger enterprise customers we have automated the delivery of outliers using several delivery mechanisms including:

  • Email
  • Dropping a CSV of the outlier list to an S3 bucket
  • API access
  • Slack
AWS billing outlier notification in Slack
Business stakeholders can be automatically notified of AWS cost outliers through collaboration tools like Slack

Delivery of outliers via API or S3 allows customers to further filter and qualify delivered outliers based on their business logic. They can then deliver the outliers via their own mechanism such as slack and provide links and images using our API to pull the detailed chart and line-item details for each outlier.

Why FinOps Practitioners Must Stay Vigilant

Ideally, we want to prevent wasted spending before It occurs. Amazon provides automatic mechanisms to help us avoid unexpected costs from occurring by automatically cleaning up after ourselves—such as automatic EBS volume deletion.

However, if something we have done is not automatically covered (or we did not set it up a cleanup mechanism in the first place) we want to detect variations in our spend so we can act quickly to avoid an unnecessary ongoing expense.

Anomaly detection allows us to capture and track the added expense of cloud resources that were intended to be transient, notifying us to terminate these resources if they were mistakenly left active.