November 29, 2017

Why Isn’t My Cloud Raining Money?

It is a common belief that many organizations can save money by moving their products “To the Cloud!” (One of my most hated catchphrases). However, once they migrate their current on-premises infrastructure up to the cloud, they find that the monthly costs are significantly higher than they thought, or at least not lower than they’d been before making the move.

This inevitably devolves into an argument of “Why did we do this?” or more commonly “You must be doing this wrong.” The truth of the matter is, of course, not as simple or straightforward as that. I will focus primarily on AWS, but most of these behaviors apply to all public cloud providers in one way or another.

Some of the most common causes of unexpected AWS spend fall into the following buckets:

  • Treating the cloud like a virtual data center
  • Unforeseen service usage or cost
  • Not utilizing cloud-native features

Let’s go into each of these a little deeper:

Treating the cloud like a virtual data center

It’s very common as a first venture into the cloud to perform a ‘lift-and-shift’ or ‘forklift’ migration. In other words, make your cloud infrastructure match your current physical data center infrastructure. This strategy minimizes much of the risk of changing data centers – I often recommend that it should be seen as step 1, not the final step of migration. The draw of moving your capital expense to an operational expense may obscure the fact that you are moving from an ownership model to a lease model. As anyone will tell you (besides a car salesperson), leasing resources almost always costs more over time than owning those resources outright. The three primary reasons products move to the cloud, besides a C-level strategy consisting entirely of “get to cloud,” are capacity, elasticity, and geography. Elasticity is key to cost control – instead of leasing the resources, use them more like a ride-share service. Use the capabilities that you need and then release them.

Another significant cost-increasing behavior is in the area of recovery. In a cloud environment, you replace or recreate resources, instead of repair them. Decades-old practices that were very much needed, such as nightly backups of every server in a product architecture, aren’t necessary in a cloud environment. In a cloud environment, with its capacity and elasticity, you can recreate the machine instead of repairing it to deal with these issues. So, re-evaluate your backup strategy and look for ways to cut some savings. It may also provide some additional security, as once you can automatically recreate your servers, you can periodically destroy and recreate your servers to ensure that no unauthorized malware or other configuration or settings have deviated from the server that you deployed.

The cloud is often used as an offsite failover site. With its geographical distribution, it’s an ideal means of providing a reliable failover or redundant site. In traditional datacenters, you would have a production-worthy (or at least something close to production-worthy) set of hardware so that should failover occur, your customers’ needs would still be met. In the cloud, take a ‘pilot light’ approach – keep the failover site running, but at absolute minimum capacity. Have controls, whether automated or manual, that ‘up-sizes’ the offsite only when needed. Why pay for massive horsepower until needed?

Unforeseen service usage or cost

A very common occurrence is while you may have planned out your like-for-like server sizing, when you get your first month’s bill, there is significant sticker shock. This comes from a number of sources, but the most common are:

Support

I highly recommend at least business level support for all accounts. Without this level, you won’t be able to even open tickets if you have any issues (spoiler alert: you will). Support is charged as a percentage of your overall AWS spend. It’s a tiered percentage (lower percentage as your total goes up), but it can represent anywhere from 3-10% of your monthly bill. The upside is that as you lower your other spend, it will ‘double dip’ into your support bill.

‘Leaving the basement lights on’

If you are using AWS for your development, testing, or CI pipeline, consider simply turning off your instances in the overnight hours. Use a scheduled Cloudwatch event to turn off servers, and another to turn them on. A stopped instance will still cost you for the attached storage, but the compute charge will be zero. Be sure to ‘stop’ instead of ‘terminate.’ In a similar vein, CloudWatch logs, by default, persist forever – and while it is not particularly expensive, it will add up over time, so review your log streams periodically and purge unneeded ones, or set a retention policy so that old events expire and are removed.

“But it’s just a GB…”

Many of the services in the cloud, particularly storage, are priced at a very small unit – GB. However, once you have virtually endless capacity, it is incredibly easy to store way more data than you initially expected. This is particularly impactful in EFS and S3, where there is virtually no upper limit. In other words, storage may be cheap, but data is endless. Be sure to have a removal policy in place for your S3 buckets, and review mid-month (or weekly) on your billing amounts for these potentially boundless data pits.

Over-provisioning

It’s very comforting to like-size your AWS resources to your on-premises resources. But remember, you sized your on-premises resources based on many other factors besides actual compute power – lease agreements, procurement time, 2-3 year plans, etc. all played a factor. Periodically review the performance of your resources, and if the average usage (be it by CPU, memory usage, or response time) is well below the requirements, consider resizing your machines. It’s a 5 minute effort, and it may cut your bill down significantly over the entirety of your product infrastructure. You can always resize it again to increase if you’ve over-corrected.

Another area that brings an unexpected cost is DynamoDB’s pricing model – you pay for provisioned throughput per table, and you pay for that throughput whether you use it or not. Therefore, you should conduct a regular review of your CloudWatch metrics to ensure that your actual throughput numbers match your provisioned numbers. The subtleties of DynamoDB’s auto-scaling feature is something to discuss in more detail in another post.

‘Lifecycle’ changes

The final quick item to look at that may increase your AWS bill is actually something that most people (myself included) recommend to lower costs. S3 supports ‘lifecycle’ changes – moving from ‘standard’ to ‘infrequent access’ (IA) to ‘glacier’ after a number of days after uploading files. This can be a very efficient way to save some money on the long-term storage of files that are kept around for a long period of time. However, a critical point to remember is that while the long-term storage costs are much lower (half for IA, almost ⅙ for glacier), there is a charge per GB retrieved. So, the larger the file, the more months in IA it takes for a single retrieval to not be a net cost increase.

Also note that S3 lifecycle changes are only driven by the creation or last modified date, not when the file itself has been accessed. If you need to ‘keep alive’ a file in standard storage, you will need to copy the file onto itself to update the last modified date. For other cases, make sure to size your files so they are small enough so that if you were to retrieve them, you can get just the data you need with minimal per-GB charges.

Not utilizing cloud-native features

While some of this area is covered above, there are some cloud-native features present that can be leveraged to easily lower your costs. Some of these are pretty straightforward to implement, but expect there to be some development cost associated the more you leverage cloud capabilities.

Leverage autoscaling

The first is leveraging autoscaling – if your product has a tier of application servers operating in a group, then one of the easiest ways to reduce costs is to grow and shrink your application server pool. You can do this by CPU usage or other metrics, but sometimes the easiest manner is by schedule. Most products have at least a semi-predictable usage pattern, and you can have AWS add or remove servers based purely on time of day. Be sure to give a buffer on either side, due to startup times, as well as to ensure a positive user experience. To do this successfully, of course, your product will have to be able handle the sudden creation and removal of servers. Usually this lies in externalizing session management away from the application servers into a shared store, such as ElastiCache, or into a database.

Leverage different storage options

The second is to properly leverage different storage options and be willing to move data back and forth between them. There are three major storage options in AWS:

  • Elastic File System (EFS),
  • Elastic Block Storage (EBS), and
  • Simple Storage Service (S3).

EFS is the most expensive ($0.30 per GB), but can be cross-mounted as a disk volume on multiple Linux servers (Windows support not available at the time of this writing). You are charged for the data you actually use, on average, so if you need to make a large file temporarily accessible to multiple servers, it may be an ideal choice.

EBS is a virtual hard drive that is mountable to only one machine (although you can connect to Windows), and only costs $0.10 per GB – but it is a GB of capacity, not necessarily usage. If you buy a 500GB EBS volume but only use 1GB, for example, you pay for 500GB. That said, servers need drives, so there’s a certain inevitability of using EBS. There are two features of EBS that can be especially good to know – one, you can now resize an EBS volume in-place, so you can size your disks based on short-term needs (within reason) and increase their size as needed. Note that you’ll need to update your server to take advantage of the larger drive. Second, you have the ability to ‘pay for performance’ by increasing the IOPS of the drives, but be cautious because the cost increases as you get faster.

The last storage option is S3. It’s the least expensive ($0.02 per GB), but note – it is not a file system, but rather an object store. Files cannot be updated or appended to but only created and destroyed. It cannot be effectively mounted as a file store (there are many attempts out there to do so) either. So why would you use it? Old server logs, static file resources that do not change, and data files you only periodically need on disk are great use cases. You can use S3 as a swap space, and pay the data transfer time instead of the financial cost of storage. Since S3 objects can be served up by its built in web server, you can even host your product’s web assets in a very cost effective manner. With AWS’s ever-increasing ability to process data directly on S3 (e.g. EMR, Athena, Redshift spectrum), you may not even have to take your data off of S3 to work with it.

Leverage the ‘serverless’ features

An advanced cost savings approach is to heavily lean on AWS’ “serverless” features. You do not have to redesign your entire product into microservices to benefit from them. If you have a feature that needs to be available 24/7 and is triggered by an event or time but isn’t constantly being hit, like an API or a nightly job or other event monitoring, it may be very worthwhile to try leveraging Lambda and/or API Gateway. The charges are primarily ‘per execution’ instead of ‘per hour,’ and the service is highly redundant and automatically scales to load. Plus, there’s the added benefit of not paying the hidden cost of managing, patching, and dealing with any extra servers. There are limitations and drawbacks on the technology, but moving from ‘running servers’ to ‘running code’ has many financial benefits, and I highly recommend seeing where it can fit into your product architecture.

Leverage elasticity

Lastly, one of the great features of cloud is related to its elasticity, and that you can constantly adjust. Change server sizes, adjust storage capacity, try a managed service instead of running your own servers. It may sound flippant, but play around. If things aren’t operating as expected, you can always revert. You can also leverage cloud for ‘crazy’ ideas, since it’s possible to redeploy an entire copy of your system in minutes (or at least hours). Want to try upgrading a major version of your database, or even from one database vendor to another? Spin it up, try it out, but be sure to shut it down when you’re done. This may not seem like cost saving advice, but being able to rapidly prototype significant product changes may let you leverage more cost effective approaches and different infrastructure approaches, ultimately minimizing time to value for your customers.

Summary

In closing, the biggest control over your cloud spend is your attention. Leverage AWS’s billing console (and its new cost explorer API) to see ahead of time if your bill is coming up higher than anticipated. The above suggestions are merely a starting point, and a much deeper analysis can find opportunities to improve savings, performance, stability. Looking at different cloud technologies, approaches, or even changing some business needs (can you live with a 10 minute delay to restore infrequently-queried data to run reports, if it saves you $10k?) can provide significant savings. Operational costs are the mantra of cloud computing – make sure that you can plan for them, and find ways to keep them in check.