Introduction

Cloud Computing has revolutionized how businesses manage infrastructure, offering elasticity, scalability, and a pay-as-you-go pricing model. However, cloud costs can escalate quickly, particularly in production environments where the use of on-demand instances can be a major expense. Spot instances (AWS) and Preemptible VMs (Google Cloud) present a cost-efficient alternative for businesses that can tolerate some level of interruption. These instances are offered at a fraction of the price of on-demand resources, making them an attractive option for optimizing cloud costs.

But how do you incorporate these interruptible instances into your production workloads without compromising performance or reliability? In this blog, we will explore the strategies and best practices for safely integrating spot instances and Preemptible VMs into your production environment.

1. Understanding Spot Instances and Preemptible VMs

Spot Instances and Preemptible VMs are essentially spare capacity offered by cloud providers at discounted rates. These instances can be interrupted by the provider with little notice when they need the resources back, making them ideal for workloads that are fault-tolerant or can be easily redistributed.

  • Spot Instances (AWS): AWS offers unused EC2 instances at up to a 90% discount through spot instances. These instances can be terminated by AWS with just a two-minute warning when they are needed for other purposes.
  • Preemptible VMs (Google Cloud): Similarly, Google Cloud offers Preemptible virtual machines at a much lower price point than regular instances. Google can shut these instances down after a 24-hour maximum or when resources are needed, often providing just a 30 second warning.

Both options present significant savings, but their volatility makes them unsuitable for all workloads. However, by leveraging fault-tolerant architecture and smart strategies, you can safely integrate them into production without compromising on reliability.

2. Identifying Suitable Workloads

The first step in utilizing spot instances or Preemptible VMs is to identify which parts of your workload can tolerate interruptions. Not all workloads are created equal. Mission-critical services, such as databases payment gateways, and low-latency applications, should generally not rely on interruptible resources. However, there are many workloads that can safely incorporate these instances:

  • Batch Processing: Workloads like data analysis, machine learning training, and video rendering, which can be broken into smaller tasks, are ideal candidates. Since these tasks can often be retried or restarted without much overhead, they align well with the interruptible nature of spot and Preemptible instances.
  • Microservices and Containerized Applications: In modern cloud, architectures, Microservices running in containers like Docker or Kubernetes can be a perfect fit. The ephemeral nature of containers allows for easy scaling, and orchestration tools like Kubernetes can automatically redistribute workloads when spot instances are terminated.
  • Distributed Systems: If your application is already designed to run across multiple instances in a distributed fashion, then incorporating spot or Preemptible instances is relatively straightforward. Distributed systems are inherently fault-tolerant because they rely on multiple nodes to complete tests.

For example, Netflix: Netflix is a well-known example of a company that successfully uses AWS Spot Instances. Their video encoding pipeline, which processes large amounts of data to transcode videos into different formats, runs on a combination of on-demand and spot instances. Since the pipeline can tolerate interruptions without impacting the end user experience, Netflix saves a significant amount of money using spot instances for this part of their infrastructure.

3. Incorporating Fault-Tolerance in Your Architecture

Because spot instances and Preemptible VMs can be terminated without notice, designing your architecture to be fault-tolerant is the key to their successful integration. A fault-tolerant system can continue to operate, even if parts of the system fail.

  • Use Auto Scaling: Auto Scaling is crucial for maintaining availability when using interruptible instances. AWS Auto Scaling or Google Cloud’s Instance Groups allow you to automatically add or replace instances when they are terminated, this ensures that your application continues to function by redistributing workloads to other available instances.
  • Leverage Spot Fleet or Preemptible VM Groups: Both AWS and Google Cloud provide mechanisms for managing collections of spot instances or Preemptible VMs.
    Spot Fleet (AWS): A Spot Fleet is a group of spot instances that can be used to meet specific capacity requirements. AWS automatically handles instances provisioning and replacement when spot instances are terminated.
    Managed Instance Groups (Google Cloud): In Google Cloud, Preemptible VMs can be part of a managed instance group, which automates the process of adding, removing, and balancing Preemptible instances across your application to ensure high availability.

  • Diversify Your Instance Types and Regions: One way to mitigate the risk of losing spot of Preemptible instances is to use multiple instance types and spread your workloads across different availability zones or regions, cloud providers often have spare capacity in certain instance types and regions, so distributing your workload increases the chances that some portion of your infrastructure will remain available, even if other parts are interrupted.
    For example, instead of relying solely on t3.medium instances in one AWS region, you can mix different instance families and run your application across multiple regions. This improves resilience and reduces the likelihood of simultaneous interruptions.

4. Graceful Shutdown and Recovery Mechanisms

A critical part of using spot instances and Preemptible VMs is ensuring that your system can gracefully handle interruptions. Since cloud providers give a short warning before terminating instances, your application should be designed to react appropriately.

  • Spot Instance Termination Notice (AWS): AWS sends a two minute termination notice when a spot instance is about to be interrupted. You can set up instance health checks or use scripts to listen for this termination notice and gracefully shut down tasks or migrate them to other instances.

  • Check pointing and State Management: For tasks that take a long time to process, it’s essential to implement check pointing. This allows your system to periodically save progress, so when an instance is terminated, the workload can resume from the last checkpoint instead of starting from scratch.
    For example, in machine learning workloads, models are often trained over extended periods. By saving the model’s state periodically, you can resume training on a new instance without losing progress, reducing downtime.

5. Pricing and Budget Considerations

The major draw of spot instances and Preemptible VMs is their lower cost. However, pricing for spot instances can fluctuate based on demand, and it’s essential to monitor these changes to avoid budget overruns.

  • Spot Pricing History (AWS) : AWS provides a history of spot pricing, allowing you to analyse trends and choose instance types with the most stable pricing. You can also set a maximum price you are willing to pay, so AWS will automatically terminate instances if the price exceeds your limit.
  • Sustained Use Discounts (Google Cloud): For Google Cloud users, even though Preemptible VMs offer substantial savings, sustained use discounts for on- demand instances may offer an alternative for certain workloads that cannot tolerate frequent interruptions. Comparing both options in terms of cost and reliability is critical for determining the best approach.

6. Best Practices for Production Workloads

To safely incorporate spot instances and Preemptible VMs into production workloads, follow these best practices:

  • Always have a fallback: Never rely solely on spot or Preemptible instances for critical workloads. Use a mix of on demand and spot or Preemptible instances to balance cost savings with reliability.
  • Test resilience: Simulate spot or Preemptible interruptions in a testing environment to ensure your system can gracefully handle them before deploying to production.
  • Monitor your workloads: Use cloud provider tools like AWS Cloud Watch or Google Cloud’s monitoring solutions to track the performance of your spot or Preemptible instances and automate responses to interruptions.

Case Study: Cost Optimization with Spot Instances for Video Transcoding

VidStream, a video processing company, needed to reduce escalating cloud costs while maintaining service quality. By incorporating AWS spot instances into their video transcoding pipeline, they achieved a 70% cost reduction. VidStream used a combination of spot and on-demand instances to balance cost reliability. They implemented check pointing to save task progress, auto scaling with AWS Spot Fleet to manage capacity, and diversified instance types and regions to minimize interruptions. This approach allowed them to handle interruptions without affecting performance, scaling efficiently during peak demand while controlling costs and maintaining service levels.

Conclusion

Spot instances and Preemptible VMs offer incredible cost savings, but they come with the trade-off of potential interruptions. By strategically incorporating these instances into non-critical, fault-tolerant components of your production environment, you can achieve significant cost reductions without sacrificing performance. Leveraging auto-scaling, instance diversification and fault-tolerant architecture ensures that your workloads can seamlessly recover from interruptions and continue to deliver the necessary performance.

In today’s cloud-first world, adopting these cost saving strategies not only reduces your operational expenses but also increases your cloud infrastructure’s flexibility and efficiency. Whether you are running batch processing jobs, micro services, or distributed applications, spot instances and Preemptible VMs can help you build a more resilient cost effective cloud infrastructure.