📢 Webinar Alert! Future-Proofing IT Operations: How Charter Enhanced Reliability with Squadcast. Register Here! 🌟

Understanding the landscape of AWS compute

Jul 10, 2020
Last Updated:
Jul 10, 2020
Share this post:
Understanding the landscape of AWS compute

In the second part of our "SLOs for AWS-based infrastructure" blog , Gigi Sayfan dives deeper into understanding the landscape of AWS compute by using the lens of Kubernetes to compare and contrast & covers in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.

Table of Contents:

    Understanding the landscape of AWS compute

    AWS had a humble beginning with EC2 instances that you could create and deploy your applications. These days there multiple ways to run workloads on AWS:

    - Good old EC2 instances (including rolling your own Kubernetes)
    - ECS (Elastic Container Services)
    - EKS (Elastic Kubernetes Services)
    - Fargate
    - Serverless services
    - Lambda - Serverless functions

    Each one of these options comes with its own pros and cons, as well as their own SLOs and properties that impact the SLOs of workloads.

    Let's start with the traditional route of using plain EC2 instances.

    Roll your own applications on AWS EC2

    When you roll your own applications you typically use AWS in the IaaS mode. You benefit from auto-scaling groups, which means you can define groups of instances that can scale elastically as demand ebbs and flows. Since you are using a small set of AWS services and APIs you are not impacted by quota limits and even outages of services you don't use. But, you still need to be cognizant at the minimum of EC2, IAM and networking. There are always subtle points. For example,  if you use ASG (Auto Scaling Groups) with attach RBS volume then when an instance is recycled its EBS goes away by default. If you separate EBS attachment from the ASG then you re-attach existing EBS to a new instance.

    By and large, you are directly responsible for the management and operation of any infrastructure that you deploy yourself.

    In this day and age, this option should be reserved for legacy systems or very unique situations. Containers are all the rage, for a good reason. I highly recommend that for greenfield large-scale enterprise projects you choose one of the container-based solutions below.

    One situation where it makes sense to use EC2 is, if you want to deploy Kubernetes yourself because you are committed to Kubernetes and EKS doesn't satisfy all your needs.

    Designing SLOs for ECS-based services

    AWS ECS (Elastic Container Service) is the AWS container orchestration platform. It is comparable to Kubernetes and it is tightly integrated with other AWS resources. I have worked for a couple of years with ECS and it definitely gets the job done. If you're a dedicated AWS shop with no need to run your system on a different cloud provider or test it locally, then ECS is a serious contender for your compute foundation. You get the benefit of containerized applications and ECS will take a lot of hard work off your hands. ECS is integrated with all other AWS services like IAM, networking, CloudWatch, etc. ECS allows you to launch your containers (tasks in ECS nomenclature) on either EC2 instances that you provision or the latest and great Fargate, where AWS makes sure your containers have somewhere to run. The container images themselves.

    Here is a diagram that shows the ECS architecture

    You can find the quotas and limits associated with ECS here:

    https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-quotas.htm

    As far as SLOs go, you mostly need to worry about your applications SLOs and can rely on the ECS to take care of the infrastructure as its quotas are quite generous. You may need to divide and conquer your system strategically between clusters, services and tasks.

    Be aware that you are using a VERY proprietary platform that will be extremely difficult if not impossible to migrate to a different cloud provider or on-prem.

    Designing SLOs for EKS-based infrastructure

    ECS is great if you accept the AWS lock-in and if you don't have a lot of previous investment in Kubernetes. If you are on the Kubernetes train then ECS is a non-starter. You could always roll your own Kubernetes on EC2, but AWS received a lot of requests from its customers to provide officially managed Kubernetes support.

    AWS EKS (Elastic Kubernetes Service) is a fully managed Kubernetes service on AWS. It means that AWS manages the Kubernetes control plane (you still pay for it) and you are only responsible for managing the worker nodes. Recently EKS Introduced support for Fargate, which means you don't have to worry about managing any instance or node pool, whatsoever.

    As always there are trade offs and you pay in flexibility and control for convenience.

    We will discuss Fargate in the next section, so let's focus here, just on the managed control pane of EKS.

    With EKS you get a highly available Kubernetes control plane that runs a certified upstream version of Kubernetes (plain ol' Kubernetes). It is tightly integrated with ECR for pulling images, ELB for load balancing, AWS VPC for network isolation and IAM for authentication. Finally, EKS works very well with AWS App Mesh, which is an Envoy-based service mesh designed by AWS and for AWS. This is a lot that you have to figure out and manage yourself if you roll on your own.

    AWS also provides an EKS optimized AMI for your worker nodes.

    You are still responsible for managing your node pools, auto scaling groups and instances. One important thing to pay attention to is Kubernetes version upgrades. EKS can update the version on your master nodes, but you are responsible for upgrading any add-ons as well as the versions of Kubernetes components on your worker nodes. This is not trivial when you have large clusters.

    One other "gotcha" on EKS is pod density. The number of pods that can be assigned to any node is typically limited by available CPU and memory. However, on EKS there is another limitation, which is the available number of network interfaces, since each pod requires its own network interface. The bigger the node, the more network interfaces it has. In practice it means that if you have a lot of small pods you want Kubernetes to fit into a few large nodes you will be disappointed and discover that your worker nodes are underutilized. You will either have to package your applications into a smaller number of beefy pods or accept the waste of underutilized nodes.

    The following diagram describes the EKS experience for operators and developers:

    As always the quotas and limits of EKS can influence your overall SLOs. Here the unadjustable quotas for EKS:

    Designing SLOs for Fargate-based services

    Fargate is a serverless container solution that works with either ECS or EKS. Fargate has the potential to reduce both your operational overhead (no need to manage servers) as well as your cost (pay for what you use only).

    In addition, Fargate is built on top of Firecracker - an open source lightweight VM, implemented in Rust - that offers top-notch performance and strong security boundaries. It is somewhere between containers and traditional VMs. Notably, the latest version of Firecracker removed Docker completely, although it still depends on Contianerd.

    Here is a comparison of running applications on AWS with and without Fargate:

    If you choose to use Fargate there may be performance impact on your workloads due to sandboxing compared to plain containers. You should run stress tests with realistic traffic to ascertain if it's an issue or not.

    Otherwise, the increased security and the automated management of servers is a big win.

    As always, you should also verify that the quotas and limits and Fargte are not a deal breaker for your application.

    Those quotas will show up as part of the ECS and EKS quotas. For example, you can only run 100 Fargate taks per region, per account.

    You can have additional 250 Fargate spot tasks per region. The spot tasks use spot instances, so they are cheaper, but can go away without warning.

    There are various other restrictions associated with Fargate.

    Check them out before you commit to Fargate. Migrating away for a Fargate-based architecture can be a major project.

    Designing SLOs for Lambda-based services

    Last, but not least, let's talk about Lambda functions. Lambda is the "other" serverless technology. Fargate provisions servers for your long running services. Lambda functions are for ad-hoc or periodic invocation of some code that can be triggered by an HTTP endpoint, SNS notification or periodically.

    When designing a new service one of the major SLO decisions is if the service should be long-running or can be invoked as needed. For example, if the service keeps an in-memory cache or handles traffic non-stop then a long-running service makes more sense. But, if a service is called infrequently then basing the service on lambda functions might be better. Of course you can also mix and match and have a long running service with some endpoints that are called infrequently invoking lambda functions. However, since you already have a long-running service you may as well implement the infrequent endpoints too, to keep things uniform and in one place.

    A common practice these days is to have a truly serveless service by utilizing both Lambda functions and S3/DynamoDB as the persistent storage together with SQS/SNS for pub-sub workflows. You delegate 100% of the infrastructure to AWS and focus solely on the functionality of your service.

    Lambda functions can also be triggered by CloudWatch, Kinesis, S3 and SQS alarms.

    I like to think of Lambda functions as the glue that connects a variety of AWS services with custom logic.

    Here is an architecture of a simple Lambda-based system for file processing that utilizes S3, SNS and DynamoDB:

    Lambda sounds exciting and it is, but there is no such thing as a free lunch. Let's look at some of the limitations as well long term management considerations.

    • Lambda functions suffer from the infamous cold start problem. When your function is invoked for the first time AWS needs to find a place to run it and copy the code to the execution site.
    • Lambda functions are also restricted for 15 minutes of runtime (used to be 5 minutes). If you want to run some long-term computation you'll have to break it up into multiple parts.
    • There are restrictions on /tmp storage, on the payload size for requests and more. You can check out the full list here: lambda limits.
    • There are also specific limitations when integrating Lambda functions with other services.

    You should also be watchful of AWS deprecating specific runtime versions you depend on. In a previous company we had a Lambda function implemented using Node.js 8 and we had to scramble to upgrade it in a very inconvenient time when AWS pulled the plug on it.

    Summary

    AWS provides a huge selection of options for running workloads. There is a lot of value in the various offerings and you can definitely find the proper solutions or set of solutions for your use case. However, when designing SLOs and especially when considering how your system is going to scale you must be aware of the fine print of every offering. The more sophisticated solutions tend to have more strings attached. When utilizing cloud resources at scale cost is a primary concern. Using advanced solutions like Fargate and Lambda functions can potentially save you a lot of money, but if used without deep understanding it can actually lead to runaway spending on unneeded resources. Quotas are another major headache that you have to monitor, increase as necessary and sometimes it can even force you to change your architecture in major ways like switching to multi-account strategy to avoid account limits.

    If you are committed to AWS, study the various options and stay vigilant as existing solutions are improved, new solutions are added, limits and cost structure changes.

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    squadcast
    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    Written By:
    July 10, 2020
    July 10, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Gigi Sayfan
    SLOs for AWS-based infrastructure
    SLOs for AWS-based infrastructure
    July 8, 2020
    Kubernetes Operators for Automated SRE
    Kubernetes Operators for Automated SRE
    May 27, 2020
    Using observability tools to set SLOs for Kubernetes Applications
    Using observability tools to set SLOs for Kubernetes Applications
    April 16, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024