Cloud Journey — Part 10 | FinOps

Chris Shayan
18 min readFeb 22, 2024

Reserved Instances (RIs), Savings Plans (SPs), Committed Use Discounts (CUDs), and Flexible Committed Use Discounts (Flexible CUDs), collectively known as commitment-based discounts, are the most popular and important cost optimizations that cloud service providers offer. This is because commitment-based discounts represent the largest percentage discount you can achieve in cloud and often apply to the largest areas of cloud spend in your bill.

Read on my latest Cloud series Part 10 | FinOps.

Cloud Journey Series:

Book Recommendation

Cloud FinOps: Collaborative, Real-Time Cloud Financial Management is a great book by by J.R. Storment and Mike Fuller. FinOps brings financial accountability to the variable spend model of cloud. Used by the majority of global enterprises, this management practice has grown from a fringe activity to the de facto discipline managing cloud spend. In this book, authors J.R. Storment and Mike Fuller outline the process of building a culture of cloud FinOps by drawing on real-world successes and failures of large-scale cloud spenders.

Engineering and finance teams, executives, and FinOps practitioners alike will learn how to build an efficient and effective FinOps machine for data-driven cloud value decision-making. Complete with a road map to get you started, this revised second edition includes new chapters that cover forecasting, sustainability, and connectivity to other frameworks.

FinOps Principles

By adhering to these principles, FinOps teams can create a cost-conscious, self-governing culture within their firms that encourages cost responsibility and business agility, allowing them to better control and optimize costs while preserving the cloud’s benefits for innovation and velocity. These values for FinOps are:

(1) Collaboration as a Cross Functional Team:

  • Teams work together in near real time as the cloud operates on a per-resource, per-second basis.
  • Continuously improve for efficiency and innovation.

(2) Decisions are driven by the business value of cloud not just technology

  • Unit economic metrics demonstrate business impact better than aggregate spend.
  • Make conscious trade-off decisions among cost, quality, and speed.
  • Think of cloud as a driver of innovation

(3) Everyone takes ownership of their cloud usage.

  • Accountability of usage and cost is pushed to the edge, with engineers taking ownership of costs from architecture design to ongoing operations.
  • Individual feature and product squads/tribes are empowered to manage their own usage of cloud against their budget.
  • Decentralize the decision making around cost-effective architecture, resource usage, and optimization into tribes.
  • Technical teams must begin to consider cost as a new efficiency metric from the beginning of the software development lifecycle.

(4) FinOps reports needs to be democratized.

  • Real-time visibility autonomously drives better cloud utilization.
  • Consistent visibility into cloud spend is provided to all levels.
  • Create, monitor, and improve real-time financial forecasting and planning.
  • Trending and variance analysis helps explain why costs increased.
  • Internal team benchmarking drives best practices and celebrates wins.
  • Industry peer-level benchmarking assesses your company’s performance.

(5) FinOps Chapter.

  • The FinOps chapter encourages, evangelizes, and enables best practices in a shared accountability model, much like security, which has a central team yet everyone remains responsible for their portion.
  • Executive buy-in for FinOps.
  • Rate, commitment, and discount optimization are centralized to take advantage of economies of scale. Remove the need for engineers and operations teams to think about rate negotiations, allowing them to stay focused on usage optimization of their own environments.

(6) Take advantage of the variable cost model of the cloud.

  • The variable cost model of the cloud should be viewed as an opportunity to deliver more value, not as a risk.
  • Embrace just-in-time prediction, planning, and purchasing of capacity.
  • Agile iterative planning is preferred over static long-term plans.
  • Embrace proactive system design with continuous adjustments in cloud optimization over infrequent reactive cleanups.

A New Way of Working Together

FinOps model requires a cross-functional team manages the cloud strategy, governance, and best practices and then works with the rest of the business to transform how the cloud is used.

  • Executives (e.g., Head of Infrastructure, Head of Cloud Center of Excellence, CTO, CDO, or CIO) focus on driving accountability and building transparency, ensuring teams are being efficient and not exceeding budgets. They’re also drivers of the cultural shift that helps engineers begin considering cost as an efficiency metric.
  • Engineers and ops team members — such as Lead Software Engineer, Principal Systems Engineer, Cloud Architect, Service Delivery Manager, Engineering Manager, or Director of Platform Engineering focus on building and supporting services for the organization. Cost is introduced as a metric in the same way other performance metrics are tracked and monitored. Teams consider the efficient design and use of resources via activities such as rightsizing (the process of resizing cloud resources to better match the workload requirements), allocating container costs, finding unused storage and compute, and identifying whether spending anomalies are expected.
  • Finance team members use the reporting provided by the FinOps chapter for accounting and forecasting. They work closely with FinOps practitioners to understand historic billing data so that they can build out more accurate cost models. They use their forecasts and expertise from the FinOps chapter to engage in rate negotiations with cloud service providers.
  • Procurement and sourcing people manage the business’s relationship with vendors, including the cloud service providers. This relationship may be very different from other IT services relationships managed by the company in the past, and it may be more complex and varied to manage. They are interested in getting the best pricing for the commitments the company is making to each vendor, and making sure that the organization receives all the benefits of its patronage and commitments over time.
  • Product teams members, including Product Managers, Product Owners, Portfolio Owners, Service Owners, Application Leads, and the like are responsible for a service or product or product line. They will work closely with the FinOps team to understand, often in ways they could not when supported in the data centers, the total cost of running the features of their product or application. Product leads are interested in new features to solve customer needs, and will be able to use FinOps-provided data to understand product profitability, direct cost efficiency efforts, and more accurately forecast costs based on new feature releases.
  • FinOps practitioners are the beating heart of a FinOps practice. They understand different perspectives and have cross-functional awareness and expertise. They’re the central team that drives best practices into the organization, provides cloud spend reporting at all the needed levels, and acts as an interface between various areas of the business. They can be cloud-savvy financial leaders or cost-conscious engineers.

This cultural shift also enables those in leadership positions to have input into decision making in a way they currently don’t. Based on leadership input, teams make informed choices about whether they are focused solely on innovation, speed of delivery, or cost of service. Some teams go all-in on one area with a growth-at-all- costs mindset. Eventually the cloud bill gets too big and they have to start thinking about growth and cost together. For example, “Move fast, but keep our cost per customer transaction below $0.45.”

Adoption of a FinOps

When proposing the adoption of a FinOps function within an organization, brief a variety of personas among the executive team (engineering leadership, finance leadership, etc.) to gain approval, buy-in, and involvement in conducting FinOps and achieving its goals.

https://data.finops.org/

Each executive team persona is described below, in terms of their goals, concerns, key messaging, and useful KPIs. By understanding the motivations of each executive persona, a FinOps champion will be able to describe the value of FinOps more effectively, minimizing the time and effort to gain alignment. You can read more on Personas in FinOps https://www.finops.org/framework/personas/

Chris Shayan

The FinOps Framework provides the operating model for how to establish and excel in the practice of FinOps. Like FinOps, the Framework is evolving and informed by community experiences, contributions, and conversations. It’s built by the community, for the community. You can read more: https://www.finops.org/framework/

Rate Optimization

As you know that Cloud Cost = Rates × Usage this section focus on the other half of that equation, and cover how to optimize rates to pay less for the resources you continue to use. Reservations, Savings Plans (SPs), Reserved Instances (RIs), Committed Use Discounts (CUDs), and Flexible CUDs are the primary levers for adjusting rates for many services, but they can be quite complex.

Comparison of reservation options across the big three cloud service providers

Commitment-Based Discounts

Reserved Instances (RIs), Savings Plans (SPs), Committed Use Discounts (CUDs), and Flexible Committed Use Discounts (Flexible CUDs), collectively known as commitment-based discounts, are the most popular and important cost optimizations that cloud service providers offer. This is because commitment-based discounts represent the largest percentage discount you can achieve in cloud and often apply to the largest areas of cloud spend in your bill.

Some years ago, during a webinar with AWS, Cloudability, and Adobe on the power of RIs, Adobe showed below figure, indicating that the company had cut its EC2 spending by 60% simply by purchasing RIs.

Commitment-based discounts are most often applied to individual resources in a nondeterministic way. In the case of AWS SPs and Azure SPs, they are applied where they will have the biggest savings, but you can’t pick which specific resources they will be applied against. Google Cloud lets you choose how to attribute the discount credits and fees. They give you three options: Unattributed, Proportional Attribution, or Prioritized Attributions. The right attribution model to choose depends whether your organization purchases and manages CUDs in a centralized or decentralized way. An analogy might be useful here to help you better understand reservations or commitments.

Say a specific restaurant is running a deal where you buy a book of coupons. Each coupon gives you a meal at that specific restaurant. The book contains one coupon for every day of the month. When you eat at the specified restaurant, you pay for the meal with the coupon. Deciding to eat somewhere else means that you forfeit that day’s coupon and pay full price on the meal at another establishment. Let’s say the book costs $750 and contains 30 meal coupons, where each coupon gets you a meal that, if bought without a coupon, would cost $50. Divided out, that’s $25 per coupon for a $50 meal, saving you $25 a day. If you eat at this restaurant every day, you save 50%, and if you eat there only half of the days, you’ve saved nothing. If you use more than half of the coupons, you’re better off buying the book of coupons.

If you apply this idea to RIs, once you’ve decided on the length of time you want to reserve, you purchase a reservation (book of coupons) from the cloud service provider, matching a particular resource type and region (specific restaurant meal at a certain location). This reservation will allow you to run the matching resource every hour (or second or millisecond). If you don’t run any resources matching the reservation, you forfeit the savings. As long as you have enough resource usage during the reservation term, you benefit from the discounts — and you save money.

The key takeaways here are:

  • You pay for the commitment-based discount, whether it’s applied to a resource or not.
  • Reservations cost money, but they offset the cost of resources in your account.
  • You do not need to utilize a commitment-based discount fully to save money.
Comparison of reservation offerings across the top three cloud service providers

More recently, AWS and Azure have offered On-Demand Capacity Reservations (ODCRs), which allow you to perform capacity reservations separately from RIs. Google Cloud offers a similar concept called Compute Engine zonal resources, which provide a very high level of assurance in obtaining capacity for a specific VM type. Note that when you’re using on-demand capacity reservations in combination with AWS RIs, it’s essential to set your RIs as regional so the RIs can discount the capacity reservation.

People commonly misunderstand how RI sharing works. To ensure you get it right, take a look at the example below:

Discount interactions with RI sharing

In the example layout in above figure, the following may occur:

  • RIs in the management account can apply to account A or B because they both have RI sharing enabled.
  • Account A can benefit both from the RIs purchased within the account and from RIs in the management account but will have preference for the RIs within the account.
  • Account B can benefit from RIs from both account A and the management account.
  • Because account C has RI sharing disabled, the RIs within that account will not apply a discount outside of account C and will go unused when no usage matches.
  • Account D will not receive any RI discounts because it has RI sharing disabled and does not have any RIs purchased within the account.

EC2 RIs are purchased to match a particular instance size (small, medium, large, xlarge, etc.). The RI will apply discounts to resources matching its size. However, for regional Linux/UNIX RIs, you benefit from the feature mentioned earlier, called instance size flexibility (ISF). Reservations you currently own or are planning to buy with attributes of Linux, regional, and shared tenancy will automatically be an ISF RI.

ISF allows the RI to apply discounts to different size instances in the same family (m5, c5, r4, etc.). A single large RI can apply discounts to multiple smaller instances, and a single small RI can apply a partial discount to a large instance. ISF gives you the flexibility to change the size of your instances without losing the discount applied by an RI. Because you don’t need to be specific about the exact size of the RI for it to cover all your different size instances, you can bundle up your RI purchases into small variations of parameters.

Conversion of AWS instance sizes to a normalized size with size flexibility

The figure in above illustrates a way of thinking about how ISF can be applied. Each column represents the instances that were run in a given hour. For r5 and c5 instances, large (L) is the smallest instance size. If you aim for 100% RI utilization in this example, you’d purchase 29 large RIs. Think of it as purchasing LEGO blocks at the smallest size within a family and combining them to cover larger instance sizes you are actually running. This would mean you’d have only one size of RI you purchase that’s well matched to your overall usage. If instance sizes fluctuate but normalized usage stays the same or increases overall, your RIs will remain perfectly utilized.

Normalization factors for AWS reservations

The normalization factor in above table shows how to convert between instance sizes within a family. For example, if you own an RI that’s 2xlarge (16 units), that RI could apply instead to two xlarge (8 units each) or four large (4 units each) instances. You could also use an RI that is 2xlarge (16 units) to apply to one xlarge (8 units) and two large (4 units each) instances.

Some instance families do not have all the sizes of instances, so, if purchasing ISF RIs, just look for the smallest instance size in the family.

In late 2019, Amazon Web Services announced Savings Plans, initially offering dis‐ counts on EC2 instances, Lambda, and Fargate, the managed service of Amazon’s proprietary container service. Since the initial release, AWS has added an additional SP offering that applies to the AWS machine learning service SageMaker. The addition of the machine learning SP has driven up demand from AWS customers for AWS to release further SPs to cover other services, such as database and storage. Based on this customer demand, we expect additional offerings to be announced after this edition of the book is released. A large amount of what you learned here about RIs applies to SPs as well. AWS has continued the purchasing options of All upfront, Partial upfront, and No upfront payment and one- and three-year durations.

However, the biggest difference between SPs and RIs is that while RI commitments are for resource units (numbers of EC2 instances of certain sizes), SP commitments are monetary (the amount a customer is committing to spend on discounted compute or other covered services). SPs are offered in three plan types:

Compute SPs

Apply broadly across EC2, Lambda, and Fargate compute, offering savings comparable to CRIs. This plan type applies more widely than a CRI — to include compute resources in any region — which will lower the amount of effort in maintaining high plan utilization. These generally return the same savings rate as CRIs.

EC2 Instance SPs

Apply to EC2 usage of a single family in a single region in any size or other configuration. While this plan type is more restrictive than a Compute SP, it offers higher discounts and is less restrictive than an SRI, while providing the same discount as SRIs.

Machine Learning SPs

Sticking with the cost/hour model of SPs, the machine learning SP for SageMaker enables AWS customers to commit to spend on SageMaker in return for reduced rates on component costs of SageMaker, including those on:

  • Studio and On-Demand Notebooks
  • Processing
  • Data Wrangler
  • Training
  • Real-Time Inference
  • Batch Transform

There are, however, a few major differences between RIs and SPs:

  • Both SP types that discount EC2 instances offer more flexibility in the compute they apply discounts to compared to RIs, removing the need to consider tenancy, operating system, instance family, or instance size. Most importantly, Compute SPs apply across all regions, which will significantly improve how much of your usage is coverable.
  • With RIs, you purchase a number of reservations that match instances in your accounts. SPs are purchased as a committed hourly spend. The spend amount that you commit to is post-discount. For example, if you’re running $100/hour of on-demand EC2 and an SP offers a 50% discount, you need to purchase a $50/hour SP.
  • Unlike RIs, SPs apply discounts to AWS Fargate and Lambda usage. This is the first AWS offering to discount usage across multiple service offerings. Because these other services are discounted at very different discount rates, this also can affect the overall benefit you receive from an SP, discussed more in the next chapter.
  • There is no equivalent RI offering for AWS SageMaker.

AWS will apply SP coverage to the resources that give you the greatest discount, which is nice because not all types of compute or instances are discounted by the same amount. This also means that, as you get more and more coverage, the increase in discount percentage you are receiving will get smaller.

Azure SPs operate as discounts to usage hours over a period of time. You save money by committing to a fixed hourly cost on compute services over a one- or three-year term. Azure advertises savings of up to 65% from pay-as-you-go prices.

Let’s look at how CUD billing and application works within the Google concept of organizations, folders, and projects. Following figure shows the structure of the hierarchy

There are similarities to AWS’s concept of a management account structure. A billing account can share CUDs across all of its connected projects, once you’ve enabled CUD sharing on the account. This means CUDs offer both the flexibility of applying to a variety of machine types and the flexibility of applying to multiple projects. Unlike AWS, where RIs are automatically shared across multiple linked accounts, CUDs are shared only after you’ve enabled the feature for your billing account.

This gives you the flexibility to choose what’s most important for your account: cleaner chargeback or maximum waste reduction. If you want to allocate funds from a particular project to make the commitment, turning CUD sharing off will offer cleaner chargeback options than the way AWS handles RIs. However, turning CUD sharing on could result in less waste because unused commitments will be shared across multiple projects.

CUDs give you the option to choose the sharing model that works best for your organization. For an organization with a lot of projects, turning sharing on allows you to take advantage of the economies of scale of the entire organization.

Steps to Building a Commitment-Based Discount Strategy

There are six key steps in building your first commitment strategy:

  1. Learn the fundamentals of each program.
  2. Understand your level of commitment to your cloud service provider.
  3. Build a repeatable commitment-based discount process.
  4. Purchase regularly and often.
  5. Measure and iterate.
  6. Allocate up-front commitment costs appropriately.

Figure below shows you the accumulated cost of a resource, for both on-demand rate and commitment rate, over a one-year term. In the example, the commitment costs over $300 up front, which is why the commitment line starts above $300. If you had used a No upfront commitment, then it would begin below the on-demand line. And if you used an All upfront commitment, it would be drawn as a flat line at the value of the up-front cost.

The cash flow break-even point is the date on which the commitment has cost you the same amount as if you had been using on-demand rates. Some people call the cash flow break-even point the crossover point, for obvious reasons. You’re no longer out-of-pocket for the commitment at the cash flow break-even point. However, you will continue to pay the ongoing (hourly) costs of the commitment. If you stop your usage at the cash flow break-even point, it will result in you losing money versus on-demand rates.

The total committed cost break-even point is where the on-demand cost of resource usage is more than the total cost of the commitment. This is the real break-even point, since you could cease to run any usage that matches the commitment and you would be no worse off than if you had not committed to the discount program. You would have paid the same amount on-demand versus with the commitment. After the break-even point, you realize savings from the commitment.

The difference between the on-demand line and the total cost break-even point is the savings (or loss) made by the commitment. It’s essential to understand that with this graph it doesn’t matter whether you run one resource that matches the commitment for the whole 12 months or you run many different matching resources during that period.

If you add up all the usage to which your commit applies a discount, you can then compare the total cost of the commitment versus what you would have paid for the same amount of usage at on-demand rates. This comparison will give you an indication of when you have met your break-even point and of the amount of savings you have realized.

Cloud is designed for just-in-time purchasing. You shouldn’t be running infrastructure or purchasing capacity (i.e., buying commitments) in a data center fashion

Years ago, AWS had graphs like those in above on their home page. The graph on the left showed the way you had to buy capacity in the data center and hardware world, fraught with constraints and long lead times for hardware. The graph on the right shows the ideal way to run infrastructure in the cloud: you spin it up just when you need it. The same is true for purchasing commitments.

FinOps Certified Platform

Tier of technology providers that license and deliver a software product, or are founders/maintainers of an open-source project, to help people successfully adopt cloud financial management practices aligned with the FinOps standards.

Anodot seamlessly combines all of your cloud spend into a single platform. Monitor and optimize your cloud cost and resource utilization across AWS, GCP, and Azure. Deep dive into your data details and get a clear picture of how your infrastructure and economies are changing.

Founded in 2007, Apptio is the leading provider of cloud-based IT financial management software. Our applications connect technology investments to business priorities, engage business stakeholders to drive cross-functional accountability, and improve the efficiency of hybrid IT resources. Cloudability by Apptio is the original FinOps solution.

Simba Innovation is a global Next Generation Cloud Managed Services and Cloud Native Services Provider. Simba Innovation partners with AWS, Azure and other public cloud providers focus on helping customers with high-level global services, such as cloud consulting, migration and deployment, and professional services for cloud.

Designed and founded by engineers in 2015, Yotascale optimizes the world’s cloud computing spend, making cloud computing profitable and sustainable, for every organization. It creates cloud cost visibility and enables resource transparency by empowering engineering teams. Yotascale’s next-generation cloud cost management solution identifies resource waste, enables cross-functional collaboration, improves optimization by 5x, and reduces yearly costs by 50%.

You can find more tools and vendors in here: https://www.finops.org/certifications/finops-certified-platform/

--

--

Chris Shayan

Purpose-Driven Product Experience Architect. I’m committed to Purpose-Driven Product engineering. My life purpose is Relentlessly elevating experience.