Releasing new features in SaaS deployments involves many moving parts. Rolling out new features in a live SaaS environment is subject to concerns like downtime, costs, and ease of rollbacks.
The canary deployment strategy employs the gradual rollout of new versions of the software. We can detect unforeseen errors by starting small and testing a small percentage of traffic on the latest version while the older stable version serves most of the traffic. This avoids widespread issues created by the new version. The older nodes are gradually replaced with new versions when confidence in the release is sufficient.
The origin of the term “canary” in this context comes from the mining industry. Coal miners used to send canary birds into mines to detect toxic gasses. The miners determined whether the mine was safe based on the canary's reaction. If the bird died or was negatively affected, the mine was unsafe. If the bird remained healthy, the miners were good to proceed with work. This is precisely the logic we use in canary deployments.
In general, the strategies discussed here may apply to multiple deployment scenarios. In this post, we will stick to Kubernetes and walk through an example demonstrating the benefits of canary deployments.
Summary of key canary deployment concepts
Some of the key concepts discussed in this post are summarized in the table below.
Canary deployment overview
Although this post focuses mainly on explaining and demonstrating the canary deployment strategy for Kubernetes, let’s cover the basics of other strategies for context.
This is the simplest deployment strategy. All the pods are replaced with a new version of the service or application at once. This causes downtime; if the deployment fails, it can take longer to roll back to the previous stable version.
A rolling deployment strategy in Kubernetes is the default strategy. It is intended for releasing new features without downtime. When the pod specifications are changed, Kubernetes starts to replace the currently deployed pods with new image versions.
The maxSurge and maxUnavailable parameters in the YAML files control this behavior. maxSurge indicates the number of pods that are allowed to be created beyond the desired number of instances. The extra capacity is used to create pods with a new image version. Whereas maxUnavailable defines the number of pods that can be decommissioned once the new pods are running. These parameters are used to make sure not all the pods are decommissioned at once.
A rolling deployment strategy is a safer and better (no downtime) way to release new features compared to the basic deployment. However, given its nature, it may be time-consuming as far as the testing efforts are concerned. This is also true for full rollout and rollback measures.
This strategy creates a parallel environment with the same infrastructure set but a new version of the application code. This new environment is called a staging environment. Staging is where tests can be performed before the production release.
This approach allows us to thoroughly test the staging environments without worrying about downtime, as the production serves traffic in complete isolation. Typically, the staging environment is exposed to internal consumers or a selected set of users to ensure its stability before release.
Once the stability is established, the incoming traffic is switched to the staging environment at the load balancer level, and the older environment is decommissioned.
Canary deployment provides more control over releases. With a canary deployment strategy, there is no need for a different environment, thus no additional costs. The user acceptance testing can happen in production with minimal impact. Nowadays, we also use various automated testing strategies to gauge the impact of releasing the changes, especially if we have a defined low-impact target environment for releasing new features. The rollbacks are easier.
The canary deployment concept centers on introducing production changes in small increments of pods/instances serving the traffic. Out of all the currently deployed pods, a small percentage are replaced with new versions of the application image.
The load balancer invariably distributes the traffic amongst all the nodes resulting in a small percentage of traffic being served by the new version of the application image. For example, if 100 pods are running in the production environment and serving 100% of the requests, we can start by replacing 2% of the pods (i.e., 2 out of 100 pods) with a new image.
The load balancer still distributes the traffic to all 100 pods. Consequently, the two new pods get tested. If the requests are being served by these new pods successfully, then more pods are replaced. Let’s say we replace 10% of the pods in the next step.
If something goes wrong, administrators can roll back by simply resetting the number of replicas in the YAML files for older and newer deployments. In the case of our example, the replica count of the YAML file responsible for deploying older versions can be reset to 100 and the new one to 0. Kubernetes will take care of this change automatically.
Such a gradual increase in the number of pods helps increase confidence, and eventually, all 100% of the pods are replaced with new application images, marking the release as successful.
Canary deployment demonstration
Suppose we have a custom Nginx service served by ten pods deployed within our Kubernetes cluster. The current version of our service is v1.0, and we would like to release a new version, v2.0. For this example, the difference between versions is the content on the web pages as shown in the examples below.
Let us assume we have deployed 10 pods of v1.0 on our K8s cluster as shown in the diagram below.
This is achieved by creating the corresponding specifications in our deployment YAML file for v1.0. In the YAML example below, we have used an image with a tag v1.0 and the spec specifies to create 10 replicas of it.
The output of the kubectl command confirms the same.
To release a new version (v2.0) of our custom Nginx image using a canary deployment strategy, we begin by creating a new deployment file. This will create a second deployment object in the Kubernetes cluster. However, the load balancer service would be the same. Following is the YAML file we created for v2.0.
As a first step, we intend to replace 10% of the pods. Thus, we create a single replica of v2.0 and reduce the corresponding replica count of v1.0 deployment to 9. The desired future state is represented below:
Next, we’ll “kubectl apply” both the deployment YAMLs. The output below reflects the corresponding pod deployments - 9 v1.0 and 1 v2.0 pods..
If we continuously refresh our web page, we will see that 10% of the requests respond with v2.0. That confirms we have successfully tested and deployed the first step of our canary deployment.
The curl output below confirms the same.
From here on, we can gradually increase the number of pods for v2.0 and decrease the corresponding number of pods for v1.0. The image below summarizes the steps till we reach a point where all the pods are replaced.
The output below confirms the same. At this moment, any request that reaches our custom Nginx service will always be served by v2.0.
How to rollback changes
At any step, if the tests fail or if the desired results are not achieved, rolling back this deployment is quick and easy. Simply changing the number of replicas in the deployment YAML file of v1.0 back to the original value (10), and deleting the deployment of v2.0 replaces the pods to the previous stable version.
With canary deployments, rollbacks are very safe as it is possible to roll back at every step described above. For example, if things go wrong in the first step, where the impact is controlled or negligible, we can delete the v2.0 deployment and reinstate the number of replicas for v1.0.
Thus we get to test the new release in production with no downtime before we decide to replace all the replicas with new application versions.
How to test and monitor canary deployments
In the example above, we replaced the pods and let the load balancer distribute the traffic based on the routing protocol set. This essentially randomizes the requests, and thus tracking specific requests does not happen in the most efficient way.
To test the canary deployment predictably — the way can do in Blue-Green deployments — we can use various parameters from incoming requests.
These request parameters typically provide information about the origin of the requests. For example, the region where a request originates is based on the geolocation data or IP information. It also helps categorize users, which is a great asset when targeting such requests to the canary pods created.
Organizations can mark and identify incoming requests from users of the UAT group. The load balancer configurations are set in a way that routes these targeted requests to the v2.0 pods alone. If any feedback requires a rollback, then a simple change in configuration files is all that is needed.
Additionally, these request parameters are also helpful in providing data to monitoring systems like Prometheus to track and foresee any negative impact on the end-to-end system.
Canary deployment is the best way of rolling out new features in a SaaS environment. The approach involves introducing the new application image step-by-step in the production environment.
Starting at a small scale minimizes the probability of failure and does not require system downtime. Tweaking the load balancer settings to direct a specific portion of traffic to the new version helps us correctly target and coordinate the testing and monitoring efforts.
Also, since we utilize the existing infrastructure to roll out new features, there is no additional infrastructure cost.
Integrating a canary deployment strategy with monitoring systems to identify, collect, and analyze the same may require an additional learning curve. But once set, this can give crucial insights about the new system without harming the business.