In our latest two-part series blog, Gigi Sayfan, author of “Mastering Kubernetes”, discusses managing complex infrastructure on AWS with an eye towards SLOs (service level objectives). Though there are many ways to discuss the management of infrastructure, in this two-part series, he covers SLOs for AWS, Observability on AWS, Quotas Limits, and Optimizing cost on AWS and in the second part, he uses the lens of Kubernetes to compare and contrast compute infrastructure on AWS with Kubernetes.
In this article we will discuss managing complex infrastructure on AWS with an eye towards SLOs (Service Level Objectives). The focus will be on compute infrastructure and we will leave storage and networking for another day. There are many ways to discuss management of infrastructure. We will use the lens of Kubernetes to compare and contrast comute infrastructure on AWS with Kubernetes.
There are two primary ways to use the cloud:
1. Use is IaaS (Infrastructure as a Service)
With the IaaS approach you try to use the bare minimum of cloud services. On AWS, this it is EC2, S3, IAM and the networking services. Everything else you deploy and manage yourself.
With the All-in approach you commit to your cloud provider and use each and every service under the sun - data stores, security, CI/CD, you name it.
The IaaS approach buys you some flexibility. Often, when you already have legacy infrastructure running on-prem it is the best way. You are less locked-in to your cloud provider and it is easier to test your system locally. It comes with the responsibility to install, maintain, monitor, upgrade and patch most of your stack. This is a big deal.
The All-in approach buys you a lot of confidence that you are on the right path. You get to benefit from hundreds of years of experience and management by your cloud provider. With a click of a button or CLI command you can deploy, scale and observe a plethora of high-end services and the list keeps growing.
The All-in approach seems like a no-brainer initially. When you actually apply this strategy at scale you discover that there is a price for all this goodness. The price is literally price! The cloud is expansive.
In addition, large cloud infrastructure is complicated and you need an SRE/DevOps team with a lot of cloud expertise to benefit from it and/or pay even more for professional support.
In practice, there is almost always some hybrid unless you are very disciplined. For example, even if you choose the IaaS approach it may be too tempting to just launch some service like RDS temporarily and get a managed database. It may be just for some prototyping with the best intentions, but you often end up supporting this temporary solution for a long time, sometimes forever. Then, it can become a slippery slope.
The other extreme is not always easy to maintain either. You want to use only AWS services, but someone just installs some open source project and pretty soon it becomes successful and you have to support it now.
When running traditional infrastructure you typically care about the common CLIs (Service Level Indicators):
- error rate
Check out my previous article Using observability tools to set SLOs for Kubernetes Applications for in-depth discussion of SLOs in general and on Kubernetes in particular.
On AWS you still care about them, but the picture is much more complicated now. There are many pieces in place that you don't have to build, but just decide if you're going to use them or not.
The SLO of your application is built on top of the SLOs of the AWS services you use. Those SLOs can be difficult to ascertain because there are many ways to compose them.
For example, consider plain old blob storage like S3, there are various ways to utilze it:
- S3 standard
- Intelligent tiering
- S3 standard
- infrequent access
- S3 one zone
- infrequent access
- Glacier Deep Archive
Each one of these options come with its own SLOs and tradeoffs and there are easy ways to migrate your data from one to another and/or store hot/cold fractions of your data at different levels.
Now, consider that the S3 blob storage is just a part of the AWS data storage, access and transfer story. There are also EBS, EFS, FSx, ElastiCache, RDS, RDS Aurora, DynamoDB, DocumentDB, Neptune, RedShift, SQLDB, KeySpaces, etc. Don't even get me started on the variety of services to transfer data between those services as well as external data sources.
Check out this link for a one sentence explanation of each AWS service:
The paradox of choice is very real with AWS.
Luckily (or by design) AWS has strong observability capabilities.
To provide a service level, you must have a proper monitoring and observability posture. One of the greatest benefits of commiting to the AWS way is that observability is deeply integrated with all AWS services. AWS CloudWatch is the gateway to AWS observability.
At its core CloudWatch is a metrics repository. But now under the CloudWatch umbrella AWS centralizes all your observability needs including:
- Logs collection and analysis
- Metrics from AWS services and custom metrics
- Containerized insights
There is a whole other set of services for security and audit purposes. We will not get into it in this article.
However, even with CloudWatch the broad scope and complexity of AWS services are not easy to tame. In addition, there are other pitfalls to be aware of.
SLOs are about availability and performance. One of the unique challenges when using AWS is that each service and API comes with its own set of quotas and limits. If you are unaware, you will run into those limits and quotas at the worst time.
To ensure the availability of your applications and comply with your SLOs you must keep track of the quotas and limits of each AWS service that you use. This requires discipline and can be frustrating. A key element of the cloud is its elasticity and infinite capacity, but when you read the fine print and learn about those quotas and limits you realize that capacity planning is not a thing of the past. It just takes a different perspective.
Here is a look at the AWS quotas console
There are many quotas for each service. For example, 68 different quotas just for EC2!
Some of the quotas can be adjusted and some are fixed.
Here are the 15 quotas associated with the AWS lambda service
We need a plan to deal with quotas if we want to keep our system up and running. The alternative would be, to get surprised when bumping unknowingly against a quota. Here are some reasonable steps:
1. Understand the quotas associated with each AWS service that you use
2. Identify quotas that are relevant for your use case
3. Set up alarms to warn you when you get close to one of the quotas
4. Adjust quotas if possible (requires support request and can take a few days)
5. Design around the problem if it's not possible to adjust the quota.
At scale some quotas might force you to make major changes to your architecture. For example, you can have only 5 transit gateway attachments from your Virtual Private Cloud (VPC). This quota is not adjustable. If you need more, you will have to use multiple VPCs.
API rate limits are another obstacle you might run into that can force you to decrease the frequency you hit certain AWS APIs and as a result reduce the fidelity of your applications or find creative solutions.
Often quotas and limits are per AWS account. If you reach a scale that requires more resources than allowed your only option might be to switch to a multi-account architecture. This is far from trivial. I have done it twice in my career. The first time was switching from a single account to multiple accounts. The second time was building a multi-account architecture from scratch.
When you start to use AWS services in abundance you realize that it's super easy to provision resources either manually using the console or via command-line, SDKs and APIS. But, those resources aren't cheap. There is literally a price to pay. At a certain point, the cost of your AWS infrastructure will start to play a major role in your design decisions, in the processes you employ and as a result also in the SLOs that you commit to. How much redundancy do you need? How much hot data do you keep? How often do you refresh your dashboards?
Eventually, cost may become its own Service Level Objective (SLO).
These tools can help meet your cost SLOs and not break the bank.
Optimizing cost on AWS is a big and never ending task. You have to be vigilant and track discounts, long term commitments pricing and changes to the basic pricing of various services.
In the second part of this blog, we will use the lens of Kubernetes to compare and contrast compute infrastructure on AWS with Kubernetes and cover in detail setting of SLOs for ECS, EKS, Fargate, and Lambda based services.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.