Agile FinOps

It’s been my experience that Agile is the most effective methodology to pair with FinOps. With Agile, the focus is on prioritizing the most impactful savings with the least amount of effort. After demonstrating quick wins, you will have a much greater chance of soliciting an organizational mandate, funding, and resources that are critical to fueling an organizational FinOps program. I discourage an initial waterfall approach that manifests itself with a long run-up to action. Quite literally, dollars/euros/etc are going out the door by the second.

Where can I start today?

I’ve prioritized savings opportunities into the following 5 categories from what I consider the easiest to hardest to capture.

Unused resources
Commitments
Anomalies
Under-used resources
Architecture

It might take a lot of time to formalize a strategy for all the elements mentioned above. But, don’t get lost in defining everything before you get started. Put an agile plan together that maximizes what you can do/know now, with the least amount of effort and the largest impact.

Unused Resources

We have all been victims of forgetting. You start working on something, then for whatever reason your attention switches to something new and you forget all about what you were working on. Generally, if it’s a piece of paper on your desk you’ll see it and be reminded of it when you come back to your desk. However, when there is a lack of visibility of its existence, it is easy to forget. This is especially true when your cloud bill becomes so massive it is impractical to search for questionable charges like you could with a credit card bill. The complexity is further compounded when resources are created by others for unspecified purposes and there is a lack of resource classification to help identify its owner and use.

Ultimately, the solution is to effectively categorize resources and have an efficient method to validate their use. However, the fact you are reading this article is probably an indication that level of maturity doesn’t exist at your organization. Don’t worry, you are in good company.

Some of the more common types of unused resources are:

Virtual machines.
Virtual machines (VMs) are perhaps the most fundamental element of the cloud and the most pervasive. When it comes to potential savings associated with unused VMs, there are generally two types — (a) those that have been left running but no one is using them and (b) those that have been shut down but are still incurring costs associated with storage. The costs for the running VMs are well understood, however, the costs for shutdown VMs are less understood. Even when a VM is shut down, there are still charges for associated storage — attached disks. It has been my general experience, that the total cost of running a VM is 60% for the compute/license and 40% for the storage. Thus, even if you have turned off the VM, you’re only saving 60% of what you were paying and you still have to pay 40% ongoing. Of course, your mileage may vary. Presuming you have a large number of VMs, you will want to prioritize a list of VMs to focus your attention on. The safe bet is to focus on VMs that have been off/deallocated as well as VMs that haven’t had an “interactive” login (excluding service accounts) for over 6 months.
Storage.
In many organizations, storage is among the most costly cloud resource. Even though storage is “cheap”, charges can quickly add up, especially when premium tier storage is used in substantive amounts. Most organizations have established data retention policies that dictate how long data should be stored. However, many organizations keep data longer than they are supposed to — which is an issue in itself. They can also inappropriately apply policies in an over-conservative manner to data that was never intended to be governed by that retention policy. It’s critical to have a clear retention policy for (a) what data needs to be retained, (b) the format it should be kept, (c) how long should it be stored, and (d) whether there are any additional archival requirements. Retention policies generally focus on client and audit data. However, from a FinOps perspective, you should also establish policies for other types of data such as backups (server and database), disconnected VM disks [as a result of either deleting the VM but not its disk, or manually disconnecting a disk for a multitude of reasons], and other generic storage elements (blob, S3, etc) that haven’t logged a transaction in a specified amount of time. You should also take into account how these policies may vary in the Dev, QA, UAT, and Production environments.
Cloud Services [and Databases]
Many people presume cloud native revolves around per-transaction billing. However, cloud-managed databases are a good example of services that have a minimum cost regardless if there are any transactions. Even if these resources are unused for months or years, your organization is still paying for them. Therefore, it’s important to understand the types of cloud services–with minimum costs–your organization leverages, the various costs your organization incurs for these types as a means to prioritize, and methods to determine the last time there was substantive use.
Network connections.
Although this isn’t a typical cloud charge, network circuits and VPN connections can be used to connect your data centers, offices, clients, vendors, and the cloud together. Just like resources in the cloud, network circuits can be forgotten to be torn down. Not only can this contribute to unnecessary expenses, but it can also contribute to routing problems and security concerns. This may seem a bit out of place in this discussion, but it is important to call out because I have seen organizations with multi-millions of network circuits in annual billings that were not used. Even in cases where a long-term contract is in place, there is still the opportunity to negotiate with the vendor for other services your organization could actually use.

Commitments

Vendors generally provide more favorable pricing when organizations commit to a certain minimum of usage. In the case of your cloud provider, favorable pricing can come in the form of reserved instances (RIs) and an enterprise agreement.

RIs typically take the form of 1+ year agreements with decreasing costs the longer the term. A commitment, by definition, obligates you to abide by the initially agreed-upon terms. When considering RIs, I recommend:

You attempt to take a first pass at cleaning-up unused and under-used resources before you purchase RIs.It would be a shame to purchase RIs for resources you no longer need.
You should be cognizant of how the RI parameters can be optimized for your environment. For example, some cloud providers provide mechanisms to apply RIs to a particular subscription or resource group. This may be helpful to ensure the RI(s) is/are always bound to a particular business unit resources for chargeback purposes. But if that business unit no longer has a need for the RI, some cloud providers do NOT allow you to re-direct the RI back to a general pool to repurpose across the organization. You may wind up having to continue paying for something that you aren’t using.
You should become well-versed in your cloud vendor’s RI cancelation/refund policy and optimize your RI process accordingly. Several cloud vendors allow you to cancel your RI purchases with certain restrictions and penalties. As an example, review Azure’s RI cancelation/refund policy and AWS’s policy on how to modify, exchange, or sell your RIs.
You should consider spreading out your RI purchases over several months. You might be surprised and/or initially misunderstand how your RIs are being used. Slowly wading into the RI water is generally a more prudent methodology.

In the case of enterprise agreements, there are advantages to pursuing these agreements with your cloud provider and marketplace vendors. Enterprise agreements can be customized to provide cheaper pricing and better service if you commit to a certain level(s) of consumption. Please note that there can be situations where the organizations have created cloud resources with pay-as-you-go marketplace licenses, then signed an enterprise agreement, but never went back to replace the marketplace license resource with the more favorably priced enterprise agreement configuration. Thus, they are essentially paying double — the enterprise agreement plus the marketplace license charges. You should therefore review existing marketplace charges to see if your organization already has an enterprise agreement or could benefit from one. You should also work to ensure your organization has an enterprise agreement with your cloud provider.

Anomalies

Many FinOps Saas vendors and some cloud providers offer anomaly detection. On a reoccurring basis, they review the history of charges for all your cloud resources. If they notice a rapid change they will alert you. This can highlight misconfigured resources and fiscally irresponsible solution architectures.

These tools usually have customizable parameters to suit your needs, but they generally do not alert on slow and sustained growth. I’ve seen a number of situations where slow growth indicates a problem — such as data deletion policies that were no longer functioning. As such, I suggest using budgeting alerts and/or monitoring resources for slow-growth situations to highlight non-organic increased consumption.

In addition to looking for anomalies on a per-resource basis, I find it helpful to review the total service level charges categorized by environment (dev, qa, uat, prod, demo, etc) over time. This will highlight fiscally insensitive software development architectural patterns that are making their way to production as well as highlight misconfigurations. By sorting service types by their aggregates values you can quickly see total charges that seem out of place. One example I’ve encountered is exorbitant security scanning charges in a non-public-facing environment. In this case, the issue was that a high-traffic cloud storage container was effectively a private share but was misconfigured to look like it was a public-facing share….and incurring the financial wrath of security scanning charges.

Finally, anomalies can also highlight bugs in your cloud providers’ billing algorithms. It shouldn’t be surprising that bugs can happen everywhere. In these cases, I’ve found cloud providers to be at first very skeptical. However, after a substantial amount of analysis and due diligence on my part, I have found them agreeable to credit back the errant charges.

Under-used resources

Many FinOps tools highlight their “right-sizing” analysis features. Typically the analysis is focused on VM sizing, storage tier changes, and database tier changes. While I find this analysis insightful, I have found it to be an overhyped feature that creates false expectations for larger organizations. For large cloud environments, the tool provides a massive list of cloud resources coupled with recommendations of what the tool estimates the resources should be scaled down to.

The right-sizing tool user then mass disseminates these recommendations to individuals who know something about these resources. The expectation is that swift action will follow. My experience with that strategy has never been that streamlined and yields little to no action. The challenges, delays, and reasons for inaction include: (1) identifying and disseminating the information to the right decision maker, (2) researching how it might impact expected performance and the consequences of the change [to dependent systems], (3) the length of time it takes to research, schedule, implement a change and build a roll-back strategy, (4) the fear of negative consequences from an outage, reduced SLA or something else and (5) competing priorities of the resource owner.

I recommend using this analysis as a general guide of where to investigate. I look at the top 20 most substantial recommendations and schedule some time to discuss with the resource owner. I then repeat with the next 20. I have found email discussions around these topics to be highly ineffective.

What I have found to be more productive is to make bulk changes in the non-client-facing environments–such as dev and QA. I have had good success with sending out a mass mailing stating that our team will be converting all premium tier (SSD) storage to a standard tier (HDD) in dev and QA. Respondents have 30 days to notify our team of any exceptions they are seeking. If teams experience negative performance with these changes, you need to have a plan in place to rapidly switch them back. Following storage changes, I suggest conservatively downsizing the most impactful VMs guided by the right-sizing analysis. Then, I recommend moving on to databases.

With your experience right-sizing dev and QA, you are better informed to contemplate if similar, but more conservative, action is possible in client-facing environments.

Architecture

Architectural changes are almost always the most time-intensive savings mechanism to implement. An architectural change frequently involves a lengthy process to identify the right architecture that maximizes transactional speed, implementation time, and fiscal efficiency while minimizing risk and operational oversight. On top of that, you have to implement the change. From a FinOps perspective, I generally limit architectural discussion to two themes: (1) responding to massive cost overruns. and (2) organizational recommendations [but in no way limiting] on specific cloud services/patterns to use and others to avoid.

When organizations are monitoring their cloud costs, it is generally easy to spot an out-of-control process. I’ve seen more than my share of processes that were expected to cost ~$1k/month skyrocket to over $100k/month. Especially for production, I recommend you treat this as a FinOps incident and follow the same procedures you follow for normal incidents. I recommend a swat team be assembled to quickly triage, mitigate and establish a longer-term plan to resolve future occurrences. Typically this involves some degree of re-architecture and organizational learning. This learning should be fed back into the organization’s architectural recommendations.

There are limitless ways to implement a technology solution in the cloud — especially when you consider there are multiple cloud services that overlap with each other in their functionality. As such, at the time of construction, the developer might not realize all the various ways it can be used and how that might impact billing. Your enterprise architects and developers should be aware of the benefits, disadvantages, and optimal techniques when selecting the appropriate cloud resource. The most effective way to disseminate this is to publish a set of generic recommendations [but not requirements] of cloud services to use and cloud services to avoid. It should also include discussions on the fiscal efficiency of the various services and how to optimize them.

Two of the most discussed FinOps architectural patterns are containers and spot instances. From a FinOps perspective, containers provide a commoditized way to effectively pack your applications/services across a set of pooled VMs — such as a Kubernetes cluster.

Theoretically, a fully optimized cluster will cost far less than what it would take to deploy applications to individual servers. As such, many organizations are building infrastructure around massive clusters that host a multitude of different applications as a means to maximize savings. However, be aware that there are security concerns with multi-tenancy. These concerns are tightly coupled to the sensitivity of your organization’s services and data. You should reflect on the best container runtime model that maximizes your organization’s security and fiscal concerns.

Finally, spot instances are excess capacity cloud VMs that can be purchased at a substantial discount. However, if someone does want to purchase them [at a much higher price] you will have anywhere from 30 seconds to 2 minutes notification to complete your work before the VM is mercilessly ripped away from you. This is why these VMs are sold at a substantial discount. From a practical standpoint, spot instances are most advantageous if your processes can complete very quickly. There are several vendors that offer orchestration engines around spot VMs. The crux of these tools is their ability to shuffle workloads off the de-provisioning spot VM to a new spot VM or an on-demand VM [if there are no more spot instances available]. Some vendors offer “stateful” functionality that attaches the de-provisioned spot VM disks to the replacement VM. These techniques usually revolve around your application persisting the state to disk and then picking up that state where you left off. Applications with long-running processes will need to optimize their architectures to take advantage of this model.