Our Product Roadmap is now public. Check it out here!
Understanding the landscape of AWS compute
Gigi Sayfan
July 10, 2020
Share this post:
Squadcast way to resolve Incidents
Subscribe to our SRE newsletter
Enter your Email Id
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

In the second part of our "SLOs for AWS-based infrastructure" blog , Gigi Sayfan dives deeper into understanding the landscape of AWS compute by using the lens of Kubernetes to compare and contrast & covers in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.

Understanding the landscape of AWS compute

AWS had a humble beginning with EC2 instances that you could create and deploy your applications. These days there multiple ways to run workloads on AWS:

- Good old EC2 instances (including rolling your own Kubernetes)
- ECS (Elastic Container Services)
- EKS (Elastic Kubernetes Services)
- Fargate
- Serverless services
- Lambda - Serverless functions

Each one of these options comes with its own pros and cons, as well as their own SLOs and properties that impact the SLOs of workloads.

Let's start with the traditional route of using plain EC2 instances.

Roll your own applications on AWS EC2

When you roll your own applications you typically use AWS in the IaaS mode. You benefit from auto-scaling groups, which means you can define groups of instances that can scale elastically as demand ebbs and flows. Since you are using a small set of AWS services and APIs you are not impacted by quota limits and even outages of services you don't use. But, you still need to be cognizant at the minimum of EC2, IAM and networking. There are always subtle points. For example,  if you use ASG (Auto Scaling Groups) with attach RBS volume then when an instance is recycled its EBS goes away by default. If you separate EBS attachment from the ASG then you re-attach existing EBS to a new instance.

By and large, you are directly responsible for the management and operation of any infrastructure that you deploy yourself.

In this day and age, this option should be reserved for legacy systems or very unique situations. Containers are all the rage, for a good reason. I highly recommend that for greenfield large-scale enterprise projects you choose one of the container-based solutions below.

One situation where it makes sense to use EC2 is, if you want to deploy Kubernetes yourself because you are committed to Kubernetes and EKS doesn't satisfy all your needs.

Designing SLOs for ECS-based services

AWS ECS (Elastic Container Service) is the AWS container orchestration platform. It is comparable to Kubernetes and it is tightly integrated with other AWS resources. I have worked for a couple of years with ECS and it definitely gets the job done. If you're a dedicated AWS shop with no need to run your system on a different cloud provider or test it locally, then ECS is a serious contender for your compute foundation. You get the benefit of containerized applications and ECS will take a lot of hard work off your hands. ECS is integrated with all other AWS services like IAM, networking, CloudWatch, etc. ECS allows you to launch your containers (tasks in ECS nomenclature) on either EC2 instances that you provision or the latest and great Fargate, where AWS makes sure your containers have somewhere to run. The container images themselves.

Here is a diagram that shows the ECS architecture

You can find the quotas and limits associated with ECS here:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.htm

As far as SLOs go, you mostly need to worry about your applications SLOs and can rely on the ECS to take care of the infrastructure as its quotas are quite generous. You may need to divide and conquer your system strategically between clusters, services and tasks.

Be aware that you are using a VERY proprietary platform that will be extremely difficult if not impossible to migrate to a different cloud provider or on-prem.

Designing SLOs for EKS-based infrastructure

ECS is great if you accept the AWS lock-in and if you don't have a lot of previous investment in Kubernetes. If you are on the Kubernetes train then ECS is a non-starter. You could always roll your own Kubernetes on EC2, but AWS received a lot of requests from its customers to provide officially managed Kubernetes support.

AWS EKS (Elastic Kubernetes Service) is a fully managed Kubernetes service on AWS. It means that AWS manages the Kubernetes control plane (you still pay for it) and you are only responsible for managing the worker nodes. Recently EKS Introduced support for Fargate, which means you don't have to worry about managing any instance or node pool, whatsoever.

As always there are trade offs and you pay in flexibility and control for convenience.

We will discuss Fargate in the next section, so let's focus here, just on the managed control pane of EKS.

With EKS you get a highly available Kubernetes control plane that runs a certified upstream version of Kubernetes (plain ol' Kubernetes). It is tightly integrated with ECR for pulling images, ELB for load balancing, AWS VPC for network isolation and IAM for authentication. Finally, EKS works very well with AWS App Mesh, which is an Envoy-based service mesh designed by AWS and for AWS. This is a lot that you have to figure out and manage yourself if you roll on your own.

AWS also provides an EKS optimized AMI for your worker nodes.

You are still responsible for managing your node pools, auto scaling groups and instances. One important thing to pay attention to is Kubernetes version upgrades. EKS can update the version on your master nodes, but you are responsible for upgrading any add-ons as well as the versions of Kubernetes components on your worker nodes. This is not trivial when you have large clusters.

One other "gotcha" on EKS is pod density. The number of pods that can be assigned to any node is typically limited by available CPU and memory. However, on EKS there is another limitation, which is the available number of network interfaces, since each pod requires its own network interface. The bigger the node, the more network interfaces it has. In practice it means that if you have a lot of small pods you want Kubernetes to fit into a few large nodes you will be disappointed and discover that your worker nodes are underutilized. You will either have to package your applications into a smaller number of beefy pods or accept the waste of underutilized nodes.

The following diagram describes the EKS experience for operators and developers:

As always the quotas and limits of EKS can influence your overall SLOs. Here the unadjustable quotas for EKS:

Designing SLOs for Fargate-based services

Fargate is a serverless container solution that works with either ECS or EKS. Fargate has the potential to reduce both your operational overhead (no need to manage servers) as well as your cost (pay for what you use only).

In addition, Fargate is built on top of Firecracker - an open source lightweight VM, implemented in Rust - that offers top-notch performance and strong security boundaries. It is somewhere between containers and traditional VMs. Notably, the latest version of Firecracker removed Docker completely, although it still depends on Contianerd.

Here is a comparison of running applications on AWS with and without Fargate:

If you choose to use Fargate there may be performance impact on your workloads due to sandboxing compared to plain containers. You should run stress tests with realistic traffic to ascertain if it's an issue or not.

Otherwise, the increased security and the automated management of servers is a big win.

As always, you should also verify that the quotas and limits and Fargte are not a deal breaker for your application.

Those quotas will show up as part of the ECS and EKS quotas. For example, you can only run 100 Fargate taks per region, per account.

You can have additional 250 Fargate spot tasks per region. The spot tasks use spot instances, so they are cheaper, but can go away without warning.

There are various other restrictions associated with Fargate.

Check them out before you commit to Fargate. Migrating away for a Fargate-based architecture can be a major project.

Designing SLOs for Lambda-based services

Last, but not least, let's talk about Lambda functions. Lambda is the "other" serverless technology. Fargate provisions servers for your long running services. Lambda functions are for ad-hoc or periodic invocation of some code that can be triggered by an HTTP endpoint, SNS notification or periodically.

When designing a new service one of the major SLO decisions is if the service should be long-running or can be invoked as needed. For example, if the service keeps an in-memory cache or handles traffic non-stop then a long-running service makes more sense. But, if a service is called infrequently then basing the service on lambda functions might be better. Of course you can also mix and match and have a long running service with some endpoints that are called infrequently invoking lambda functions. However, since you already have a long-running service you may as well implement the infrequent endpoints too, to keep things uniform and in one place.

A common practice these days is to have a truly serveless service by utilizing both Lambda functions and S3/DynamoDB as the persistent storage together with SQS/SNS for pub-sub workflows. You delegate 100% of the infrastructure to AWS and focus solely on the functionality of your service.

Lambda functions can also be triggered by CloudWatch, Kinesis, S3 and SQS alarms.

I like to think of Lambda functions as the glue that connects a variety of AWS services with custom logic.

Here is an architecture of a simple Lambda-based system for file processing that utilizes S3, SNS and DynamoDB:

Lambda sounds exciting and it is, but there is no such thing as a free lunch. Let's look at some of the limitations as well long term management considerations.

  • Lambda functions suffer from the infamous cold start problem. When your function is invoked for the first time AWS needs to find a place to run it and copy the code to the execution site.
  • Lambda functions are also restricted for 15 minutes of runtime (used to be 5 minutes). If you want to run some long-term computation you'll have to break it up into multiple parts.
  • There are restrictions on /tmp storage, on the payload size for requests and more. You can check out the full list here: lambda limits.
  • There are also specific limitations when integrating Lambda functions with other services.

You should also be watchful of AWS deprecating specific runtime versions you depend on. In a previous company we had a Lambda function implemented using Node.js 8 and we had to scramble to upgrade it in a very inconvenient time when AWS pulled the plug on it.

Summary

AWS provides a huge selection of options for running workloads. There is a lot of value in the various offerings and you can definitely find the proper solutions or set of solutions for your use case. However, when designing SLOs and especially when considering how your system is going to scale you must be aware of the fine print of every offering. The more sophisticated solutions tend to have more strings attached. When utilizing cloud resources at scale cost is a primary concern. Using advanced solutions like Fargate and Lambda functions can potentially save you a lot of money, but if used without deep understanding it can actually lead to runaway spending on unneeded resources. Quotas are another major headache that you have to monitor, increase as necessary and sometimes it can even force you to change your architecture in major ways like switching to multi-account strategy to avoid account limits.

If you are committed to AWS, study the various options and stay vigilant as existing solutions are improved, new solutions are added, limits and cost structure changes.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Written By:
Gigi Sayfan
July 10, 2020
Share this post:
Related Content
Kubernetes Operators for Automated SRE
May 27, 2020
What you can show on your status page
January 14, 2020
Using observability tools to set SLOs for Kubernetes Applications
April 16, 2020

Understanding the landscape of AWS compute

In the second part of our "SLOs for AWS-based infrastructure" blog , Gigi Sayfan dives deeper into understanding the landscape of AWS compute by using the lens of Kubernetes to compare and contrast & covers in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.

Understanding the landscape of AWS compute

AWS had a humble beginning with EC2 instances that you could create and deploy your applications. These days there multiple ways to run workloads on AWS:

- Good old EC2 instances (including rolling your own Kubernetes)
- ECS (Elastic Container Services)
- EKS (Elastic Kubernetes Services)
- Fargate
- Serverless services
- Lambda - Serverless functions

Each one of these options comes with its own pros and cons, as well as their own SLOs and properties that impact the SLOs of workloads.

Let's start with the traditional route of using plain EC2 instances.

Roll your own applications on AWS EC2

When you roll your own applications you typically use AWS in the IaaS mode. You benefit from auto-scaling groups, which means you can define groups of instances that can scale elastically as demand ebbs and flows. Since you are using a small set of AWS services and APIs you are not impacted by quota limits and even outages of services you don't use. But, you still need to be cognizant at the minimum of EC2, IAM and networking. There are always subtle points. For example,  if you use ASG (Auto Scaling Groups) with attach RBS volume then when an instance is recycled its EBS goes away by default. If you separate EBS attachment from the ASG then you re-attach existing EBS to a new instance.

By and large, you are directly responsible for the management and operation of any infrastructure that you deploy yourself.

In this day and age, this option should be reserved for legacy systems or very unique situations. Containers are all the rage, for a good reason. I highly recommend that for greenfield large-scale enterprise projects you choose one of the container-based solutions below.

One situation where it makes sense to use EC2 is, if you want to deploy Kubernetes yourself because you are committed to Kubernetes and EKS doesn't satisfy all your needs.

Designing SLOs for ECS-based services

AWS ECS (Elastic Container Service) is the AWS container orchestration platform. It is comparable to Kubernetes and it is tightly integrated with other AWS resources. I have worked for a couple of years with ECS and it definitely gets the job done. If you're a dedicated AWS shop with no need to run your system on a different cloud provider or test it locally, then ECS is a serious contender for your compute foundation. You get the benefit of containerized applications and ECS will take a lot of hard work off your hands. ECS is integrated with all other AWS services like IAM, networking, CloudWatch, etc. ECS allows you to launch your containers (tasks in ECS nomenclature) on either EC2 instances that you provision or the latest and great Fargate, where AWS makes sure your containers have somewhere to run. The container images themselves.

Here is a diagram that shows the ECS architecture

You can find the quotas and limits associated with ECS here:

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.htm

As far as SLOs go, you mostly need to worry about your applications SLOs and can rely on the ECS to take care of the infrastructure as its quotas are quite generous. You may need to divide and conquer your system strategically between clusters, services and tasks.

Be aware that you are using a VERY proprietary platform that will be extremely difficult if not impossible to migrate to a different cloud provider or on-prem.

Designing SLOs for EKS-based infrastructure

ECS is great if you accept the AWS lock-in and if you don't have a lot of previous investment in Kubernetes. If you are on the Kubernetes train then ECS is a non-starter. You could always roll your own Kubernetes on EC2, but AWS received a lot of requests from its customers to provide officially managed Kubernetes support.

AWS EKS (Elastic Kubernetes Service) is a fully managed Kubernetes service on AWS. It means that AWS manages the Kubernetes control plane (you still pay for it) and you are only responsible for managing the worker nodes. Recently EKS Introduced support for Fargate, which means you don't have to worry about managing any instance or node pool, whatsoever.

As always there are trade offs and you pay in flexibility and control for convenience.

We will discuss Fargate in the next section, so let's focus here, just on the managed control pane of EKS.

With EKS you get a highly available Kubernetes control plane that runs a certified upstream version of Kubernetes (plain ol' Kubernetes). It is tightly integrated with ECR for pulling images, ELB for load balancing, AWS VPC for network isolation and IAM for authentication. Finally, EKS works very well with AWS App Mesh, which is an Envoy-based service mesh designed by AWS and for AWS. This is a lot that you have to figure out and manage yourself if you roll on your own.

AWS also provides an EKS optimized AMI for your worker nodes.

You are still responsible for managing your node pools, auto scaling groups and instances. One important thing to pay attention to is Kubernetes version upgrades. EKS can update the version on your master nodes, but you are responsible for upgrading any add-ons as well as the versions of Kubernetes components on your worker nodes. This is not trivial when you have large clusters.

One other "gotcha" on EKS is pod density. The number of pods that can be assigned to any node is typically limited by available CPU and memory. However, on EKS there is another limitation, which is the available number of network interfaces, since each pod requires its own network interface. The bigger the node, the more network interfaces it has. In practice it means that if you have a lot of small pods you want Kubernetes to fit into a few large nodes you will be disappointed and discover that your worker nodes are underutilized. You will either have to package your applications into a smaller number of beefy pods or accept the waste of underutilized nodes.

The following diagram describes the EKS experience for operators and developers:

As always the quotas and limits of EKS can influence your overall SLOs. Here the unadjustable quotas for EKS:

Designing SLOs for Fargate-based services

Fargate is a serverless container solution that works with either ECS or EKS. Fargate has the potential to reduce both your operational overhead (no need to manage servers) as well as your cost (pay for what you use only).

In addition, Fargate is built on top of Firecracker - an open source lightweight VM, implemented in Rust - that offers top-notch performance and strong security boundaries. It is somewhere between containers and traditional VMs. Notably, the latest version of Firecracker removed Docker completely, although it still depends on Contianerd.

Here is a comparison of running applications on AWS with and without Fargate:

If you choose to use Fargate there may be performance impact on your workloads due to sandboxing compared to plain containers. You should run stress tests with realistic traffic to ascertain if it's an issue or not.

Otherwise, the increased security and the automated management of servers is a big win.

As always, you should also verify that the quotas and limits and Fargte are not a deal breaker for your application.

Those quotas will show up as part of the ECS and EKS quotas. For example, you can only run 100 Fargate taks per region, per account.

You can have additional 250 Fargate spot tasks per region. The spot tasks use spot instances, so they are cheaper, but can go away without warning.

There are various other restrictions associated with Fargate.

Check them out before you commit to Fargate. Migrating away for a Fargate-based architecture can be a major project.

Designing SLOs for Lambda-based services

Last, but not least, let's talk about Lambda functions. Lambda is the "other" serverless technology. Fargate provisions servers for your long running services. Lambda functions are for ad-hoc or periodic invocation of some code that can be triggered by an HTTP endpoint, SNS notification or periodically.

When designing a new service one of the major SLO decisions is if the service should be long-running or can be invoked as needed. For example, if the service keeps an in-memory cache or handles traffic non-stop then a long-running service makes more sense. But, if a service is called infrequently then basing the service on lambda functions might be better. Of course you can also mix and match and have a long running service with some endpoints that are called infrequently invoking lambda functions. However, since you already have a long-running service you may as well implement the infrequent endpoints too, to keep things uniform and in one place.

A common practice these days is to have a truly serveless service by utilizing both Lambda functions and S3/DynamoDB as the persistent storage together with SQS/SNS for pub-sub workflows. You delegate 100% of the infrastructure to AWS and focus solely on the functionality of your service.

Lambda functions can also be triggered by CloudWatch, Kinesis, S3 and SQS alarms.

I like to think of Lambda functions as the glue that connects a variety of AWS services with custom logic.

Here is an architecture of a simple Lambda-based system for file processing that utilizes S3, SNS and DynamoDB:

Lambda sounds exciting and it is, but there is no such thing as a free lunch. Let's look at some of the limitations as well long term management considerations.

  • Lambda functions suffer from the infamous cold start problem. When your function is invoked for the first time AWS needs to find a place to run it and copy the code to the execution site.
  • Lambda functions are also restricted for 15 minutes of runtime (used to be 5 minutes). If you want to run some long-term computation you'll have to break it up into multiple parts.
  • There are restrictions on /tmp storage, on the payload size for requests and more. You can check out the full list here: lambda limits.
  • There are also specific limitations when integrating Lambda functions with other services.

You should also be watchful of AWS deprecating specific runtime versions you depend on. In a previous company we had a Lambda function implemented using Node.js 8 and we had to scramble to upgrade it in a very inconvenient time when AWS pulled the plug on it.

Summary

AWS provides a huge selection of options for running workloads. There is a lot of value in the various offerings and you can definitely find the proper solutions or set of solutions for your use case. However, when designing SLOs and especially when considering how your system is going to scale you must be aware of the fine print of every offering. The more sophisticated solutions tend to have more strings attached. When utilizing cloud resources at scale cost is a primary concern. Using advanced solutions like Fargate and Lambda functions can potentially save you a lot of money, but if used without deep understanding it can actually lead to runaway spending on unneeded resources. Quotas are another major headache that you have to monitor, increase as necessary and sometimes it can even force you to change your architecture in major ways like switching to multi-account strategy to avoid account limits.

If you are committed to AWS, study the various options and stay vigilant as existing solutions are improved, new solutions are added, limits and cost structure changes.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Learn more about Squadcast:
July 10, 2020
Gigi Sayfan
About the Author:
Experience the Journey from
On-call to SRE
Experience the Journey from On-call to SRE
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast recognized in Incident Management based on user reviews Users love Squadcast on G2 Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Incident Management on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
Squadcast recognized in Incident Management based on user reviews Users love Squadcast on G2 Squadcast is a leader in Incident Management on G2
Squadcast is a leader in Incident Management on G2 Squadcast is a leader in IT Service Management (ITSM) Tools on G2
Copyright © Squadcast Inc. 2017-2020
Our Product Roadmap is now public. Check it out here!