Canary Deployment: Tutorial & Examples

Releasing new features in SaaS deployments involves many moving parts. Rolling out new features in a live SaaS environment is subject to concerns like downtime, costs, and ease of rollbacks.

The canary deployment strategy employs the gradual rollout of new versions of the software. We can detect unforeseen errors by starting small and testing a small percentage of traffic on the latest version while the older stable version serves most of the traffic. This avoids widespread issues created by the new version. The older nodes are gradually replaced with new versions when confidence in the release is sufficient.

The origin of the term “canary” in this context comes from the mining industry. Coal miners used to send canary birds into mines to detect toxic gasses. The miners determined whether the mine was safe based on the canary's reaction. If the bird died or was negatively affected, the mine was unsafe. If the bird remained healthy, the miners were good to proceed with work. This is precisely the logic we use in canary deployments.

In general, the strategies discussed here may apply to multiple deployment scenarios. In this post, we will stick to Kubernetes and walk through an example demonstrating the benefits of canary deployments.

Summary of key canary deployment concepts

Some of the key concepts discussed in this post are summarized in the table below.

Concept	Description
Traffic/Ingress	All the incoming requests trying to access the SaaS product.
Services	Cloud-native SaaS products are based on microservice architecture. Services represent a component which serves a specific purpose. During releases, all or some of the services may need to be replaced with new versions.
Load balancing	Load balancers help balance the incoming traffic to multiple instances of the services to reduce latency, and improve performance. Additionally, it is also possible to configure load balancing properties to route traffic to specific nodes based on the request parameters.
Basic deployment	This deployment strategy is straightforward. We remove all the older versions of the services, and create new versions afterwards. This involves downtime.
Rolling deployment	Existing sets of instances/pods are gradually replaced with new versions. Specifying the maximum number of pods going down for replacement and the minimum number of healthy pods in existence is possible.
Blue-Green deployment	This deployment strategy creates a parallel environment with the new version, and traffic is automatically switched to the new deployment by adjusting load balancer properties. Once successful, the older environment is decommissioned.
Canary deployment	Controlled gradual deployment of new pods/instances, which facilitate easy rollback and no downtime.
Rollback	If deployment fails, the operations must be rolled back to the previous stable version.
Downtime	Length of time an application is unavailable to serve requests.

Canary deployment overview

Although this post focuses mainly on explaining and demonstrating the canary deployment strategy for Kubernetes, let’s cover the basics of other strategies for context.

Basic/recreate deployment

This is the simplest deployment strategy. All the pods are replaced with a new version of the service or application at once. This causes downtime; if the deployment fails, it can take longer to roll back to the previous stable version.

Pros	Cons
Simple to implement	Cannot avoid downtime/outage.
	Difficult to rollback.

Rolling deployment

A rolling deployment strategy in Kubernetes is the default strategy. It is intended for releasing new features without downtime. When the pod specifications are changed, Kubernetes starts to replace the currently deployed pods with new image versions.

The maxSurge and maxUnavailable parameters in the YAML files control this behavior. maxSurge indicates the number of pods that are allowed to be created beyond the desired number of instances. The extra capacity is used to create pods with a new image version. Whereas maxUnavailable defines the number of pods that can be decommissioned once the new pods are running. These parameters are used to make sure not all the pods are decommissioned at once.

A rolling deployment strategy is a safer and better (no downtime) way to release new features compared to the basic deployment. However, given its nature, it may be time-consuming as far as the testing efforts are concerned. This is also true for full rollout and rollback measures.

Pros	Cons
Default K8s behavior, easy to implement	Time-consuming
No downtime	Complex to roll back
No tweaking of load balancer

Blue-green deployment

This strategy creates a parallel environment with the same infrastructure set but a new version of the application code. This new environment is called a staging environment. Staging is where tests can be performed before the production release.

This approach allows us to thoroughly test the staging environments without worrying about downtime, as the production serves traffic in complete isolation. Typically, the staging environment is exposed to internal consumers or a selected set of users to ensure its stability before release.

Once the stability is established, the incoming traffic is switched to the staging environment at the load balancer level, and the older environment is decommissioned.

Pros	Cons
Safe way to release new features to SaaS environment	Can be costly as full scale parallel infrastructure exists for a while
No downtime	Needs configuration changes at load balancer

Canary deployment

Canary deployment provides more control over releases. With a canary deployment strategy, there is no need for a different environment, thus no additional costs. The user acceptance testing can happen in production with minimal impact. Nowadays, we also use various automated testing strategies to gauge the impact of releasing the changes, especially if we have a defined low-impact target environment for releasing new features. The rollbacks are easier.

The canary deployment concept centers on introducing production changes in small increments of pods/instances serving the traffic. Out of all the currently deployed pods, a small percentage are replaced with new versions of the application image.

The load balancer invariably distributes the traffic amongst all the nodes resulting in a small percentage of traffic being served by the new version of the application image. For example, if 100 pods are running in the production environment and serving 100% of the requests, we can start by replacing 2% of the pods (i.e., 2 out of 100 pods) with a new image.

The load balancer still distributes the traffic to all 100 pods. Consequently, the two new pods get tested. If the requests are being served by these new pods successfully, then more pods are replaced. Let’s say we replace 10% of the pods in the next step.

If something goes wrong, administrators can roll back by simply resetting the number of replicas in the YAML files for older and newer deployments. In the case of our example, the replica count of the YAML file responsible for deploying older versions can be reset to 100 and the new one to 0. Kubernetes will take care of this change automatically.

Such a gradual increase in the number of pods helps increase confidence, and eventually, all 100% of the pods are replaced with new application images, marking the release as successful.

Canary deployment demonstration

Suppose we have a custom Nginx service served by ten pods deployed within our Kubernetes cluster. The current version of our service is v1.0, and we would like to release a new version, v2.0. For this example, the difference between versions is the content on the web pages as shown in the examples below.

Let us assume we have deployed 10 pods of v1.0 on our K8s cluster as shown in the diagram below.

This is achieved by creating the corresponding specifications in our deployment YAML file for v1.0. In the YAML example below, we have used an image with a tag v1.0 and the spec specifies to create 10 replicas of it.

---


    apiVersion: apps/v1
    kind: Deployment
    
    metadata:
     name: mynginx-v1-deployment
     labels:
       app: mynginx
    spec:
     replicas: 10
     selector:
       matchLabels:
         app: mynginx
    
     template:
       metadata:
         labels:
           app: mynginx
       spec:
         containers:
           - name: mynginx-v1
             image: sumeetninawe/mynginx:v1.0
             resources:
               requests:
                 cpu: "10m"
                 memory: "150Mi"
               limits:
                 cpu: "50m"
                 memory: "400Mi"
             imagePullPolicy: Always
         restartPolicy: Always

---

The output of the kubectl command confirms the same.


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v1-deployment-55fcf9bd4c-7vqn5   1/1     Running   0          101s
    pod/mynginx-v1-deployment-55fcf9bd4c-8lcgt   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-b48gh   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-bsl99   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-dwg9l   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-m8qfl   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-qc6l2   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-s5nvt   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-sgm78   1/1     Running   0          100s
    pod/mynginx-v1-deployment-55fcf9bd4c-zgdzx   1/1     Running   0          100s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        17m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   43s
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   10/10   10           10          102s
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   10        10        10      102s

To release a new version (v2.0) of our custom Nginx image using a canary deployment strategy, we begin by creating a new deployment file. This will create a second deployment object in the Kubernetes cluster. However, the load balancer service would be the same. Following is the YAML file we created for v2.0.

---


    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: mynginx-v2-deployment
     labels:
       app: mynginx
    spec:
     replicas: 1
     selector:
       matchLabels:
         app: mynginx 
     template:
       metadata:
         labels:
           app: mynginx
       spec:
         containers:
           - name: mynginx-v2
             image: sumeetninawe/mynginx:v2.0
             resources:
               requests:
                 cpu: "10m"
                 memory: "150Mi"
               limits:
                 cpu: "50m"
                 memory: "400Mi"
             imagePullPolicy: Always
         restartPolicy: Always

---

As a first step, we intend to replace 10% of the pods. Thus, we create a single replica of v2.0 and reduce the corresponding replica count of v1.0 deployment to 9. The desired future state is represented below:

Next, we’ll “kubectl apply” both the deployment YAMLs. The output below reflects the corresponding pod deployments - 9 v1.0 and 1 v2.0 pods..


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v1-deployment-55fcf9bd4c-7vqn5   1/1     Running   0          8m20s
    pod/mynginx-v1-deployment-55fcf9bd4c-8lcgt   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-bsl99   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-dwg9l   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-m8qfl   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-qc6l2   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-s5nvt   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-sgm78   1/1     Running   0          8m19s
    pod/mynginx-v1-deployment-55fcf9bd4c-zgdzx   1/1     Running   0          8m19s
    pod/mynginx-v2-deployment-5d5f948fb7-m6xgk   1/1     Running   0          18s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        23m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   7m22s
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   9/9     9            9           8m21s
    deployment.apps/mynginx-v2-deployment   1/1     1            1           19s
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   9         9         9       8m21s
    replicaset.apps/mynginx-v2-deployment-5d5f948fb7   1         1         1       19s

If we continuously refresh our web page, we will see that 10% of the requests respond with v2.0. That confirms we have successfully tested and deployed the first step of our canary deployment.

The curl output below confirms the same.


    canaryDeployment % for ((i=1;i<=100;i++)); do   curl "104.199.38.8"; sleep .5; echo; done
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V2.0<H1>
    <H1>Hello World - V2.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>
    <H1>Hello World - V1.0<H1>

From here on, we can gradually increase the number of pods for v2.0 and decrease the corresponding number of pods for v1.0. The image below summarizes the steps till we reach a point where all the pods are replaced.

Gradual increment in the number of v2.0 pods, until full replacement.

The output below confirms the same. At this moment, any request that reaches our custom Nginx service will always be served by v2.0.


    canaryDeployment % kubectl get all
    NAME                                         READY   STATUS    RESTARTS   AGE
    pod/mynginx-v2-deployment-5d5f948fb7-2m724   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-85grg   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-blhx9   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-gps9l   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-gzzgh   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-lzpm6   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-m6xgk   1/1     Running   0          13m
    pod/mynginx-v2-deployment-5d5f948fb7-nhrxf   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-rfdfw   1/1     Running   0          24s
    pod/mynginx-v2-deployment-5d5f948fb7-zfn4k   1/1     Running   0          24s
    
    NAME                      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
    service/kubernetes        ClusterIP      10.16.0.1     <none>         443/TCP        36m
    service/mynginx-service   LoadBalancer   10.16.7.155   104.199.38.8   80:30780/TCP   20m
    
    NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/mynginx-v1-deployment   0/0     0            0           21m
    deployment.apps/mynginx-v2-deployment   10/10   10           10          13m
    
    NAME                                               DESIRED   CURRENT   READY   AGE
    replicaset.apps/mynginx-v1-deployment-55fcf9bd4c   0         0         0       21m
    replicaset.apps/mynginx-v2-deployment-5d5f948fb7   10        10        10      13m

How to rollback changes

At any step, if the tests fail or if the desired results are not achieved, rolling back this deployment is quick and easy. Simply changing the number of replicas in the deployment YAML file of v1.0 back to the original value (10), and deleting the deployment of v2.0 replaces the pods to the previous stable version.

With canary deployments, rollbacks are very safe as it is possible to roll back at every step described above. For example, if things go wrong in the first step, where the impact is controlled or negligible, we can delete the v2.0 deployment and reinstate the number of replicas for v1.0.

Thus we get to test the new release in production with no downtime before we decide to replace all the replicas with new application versions.

How to test and monitor canary deployments

In the example above, we replaced the pods and let the load balancer distribute the traffic based on the routing protocol set. This essentially randomizes the requests, and thus tracking specific requests does not happen in the most efficient way.

To test the canary deployment predictably — the way can do in Blue-Green deployments — we can use various parameters from incoming requests.

These request parameters typically provide information about the origin of the requests. For example, the region where a request originates is based on the geolocation data or IP information. It also helps categorize users, which is a great asset when targeting such requests to the canary pods created.

Organizations can mark and identify incoming requests from users of the UAT group. The load balancer configurations are set in a way that routes these targeted requests to the v2.0 pods alone. If any feedback requires a rollback, then a simple change in configuration files is all that is needed.

Additionally, these request parameters are also helpful in providing data to monitoring systems like Prometheus to track and foresee any negative impact on the end-to-end system.

Conclusion

Canary deployment is the best way of rolling out new features in a SaaS environment. The approach involves introducing the new application image step-by-step in the production environment.

Starting at a small scale minimizes the probability of failure and does not require system downtime. Tweaking the load balancer settings to direct a specific portion of traffic to the new version helps us correctly target and coordinate the testing and monitoring efforts.

Also, since we utilize the existing infrastructure to roll out new features, there is no additional infrastructure cost.

Integrating a canary deployment strategy with monitoring systems to identify, collect, and analyze the same may require an additional learning curve. But once set, this can give crucial insights about the new system without harming the business.

‍