Kubernetes Operators for Automated SRE

May 27, 2020
Share this post:
Kubernetes Operators for Automated SRE

It can be quite challenging for an SRE team to maintain the well-being of a large-scale Kubernetes based system with hundreds or thousands of services. In this blog post, Gigi Sayfan, author of “Mastering Kubernetes”, outlines the SRE challenge and how we can achieve the ultimate goal of automated SRE with Kubernetes operators

Table of Contents:


    You are part of an SRE team responsible for the well-being of a large-scale Kubernetes-based system with hundreds or thousands of services, possibly integrated with multiple 3rd party providers, maybe managing some hardware too. That’s a lot of responsibility. Everything is moving fast and you have to keep it all together. In this blog post we will talk about the SRE challenge, the ultimate goal of automated SRE, introduce Kubernetes operators and see how they can help us towards our goal. Finally, we will survey some of the current operator frameworks to get you ready to build your own Kubernetes operators.

    The SRE Challenge

    The developers are cranking out code, new features and upgrading their systems. The data keeps growing. The bits keep flowing. Everybody wants more capacity, better performance, cost saving, security in depth, total visibility and absolutely no downtime. After all, SRE stands for site reliability engineering. The site better be reliable!

    If it sounds daunting the main reason is that it is daunting!

    The SRE discipline and methodology emerged as a means to address this very problem. Let’s explore what happens when you take SRE to the limit.

    The SRE Endgame

    Federico Garcia Lorca once said “Besides black art, there is only automation and mechanization.” If you ever SSH’ed into a broken production server and started fixing it manually you know all about black art.

    Automation has multiple positive effects:

    • Consistency and predictability
    • Once it works it keeps working (unless the environment changes)
    • Can scale easily
    • Frees humans from toil

    Automation is a virtuous cycle. The moment you start automating tasks you don’t only save time on the task you automated. But you also strengthen the automation culture in your organization and open the door to more automation.

    The endgame is to have autonomous systems that can take care of themselves, self-heal, upgrade, patch security vulnerability and in general, just work.

    There are some situations where you need human oversight, but those situations become rarer and rarer as you improve your automation and gain more confidence that it can handle more real-world situations.

    Runbooks and check lists are a staple of professional operators. You can think of each runbook as an opportunity for automation. If the rules are encoded in a runbook, do you actually need a human to perform them?

    So, automation is good. But, how do we go about it? Let’s take a book from Kubernetes itself.

    Kubernetes and controllers

    Kubernetes is in its essence a bunch of control loops. It manages various resources like pods, deployments, config maps and secrets. It stores the state of those resources in etcd and then it runs multiple controllers. Each controller is responsible for a specific resource type. Its job is to reconcile the actual state of the resource with its desired state.

    The following diagram shows the Kubernetes architecture:

    The controller manager is a process that contains all these controllers. The controller watches for different events, as well as for changes to the manifests that represent their resources. When they detect that the actual state is different from the desired state they take action.

    For example the ReplicaSetController manages replica sets. If a replica set has a replica count of 3 and the ReplicaSetController detects that there are currently only 2 pods running, it will create another pod to get it back to 3.

    apiVersion: apps/v1
    kind: ReplicaSet
      name: awesome-app 
        app: awesome
      # this field controls how many pods should be running 
      replicas: 3 
        app: awesome 
           app: awesome
        - name: awesome-app
        image: g1g1/awesome-app:v3.8

    But, if a user changed the replica count in the YAML from 3 to 2 then the ReplicaSetController will kill one of the 3 pods.

    Control Loops

    If you look at the big picture of operations it's all about control loops:

    • Define a desired state
    • Watch the target system
    • Compare the desired state to the actual state
    • If the actual state is different from the desired state take action
    • Rinse and repeat

    Note that the desired state is not fixed and may change too.

    Human operators implement a control loop. They monitor their systems taking actions when the desired state (applications that need to be deployed, performance targets, supported versions of 3rd party software) deviates from the actual state. They also respond when the desired state doesn't change, but the actual system state drifts (nodes going down, manual configuration changes).

    The Operator pattern

    The operator pattern in Kubernetes aims to package the knowledge and skills of a human operator in software. It boils down to Kubernetes custom resources and a custom controller that watches the custom resources and usually some additional system. The custom controller works just like Kubernetes controllers and reconciles the desired state in the spec of the custom resource with the actual state that is reflected in the status.

    The operator pattern was conceived by CoreOS (which was acquired by RedHat, later acquired by IBM) in 2016. Here is the blog post that introduced operators to the world:


    The primary motivation was to support stateful applications that often require multiple custom steps for scaling, upgrades, backups, fail overs, etc.

    Kubernetes can handle stateless workloads pretty well, but it only offers the StatefulSet for stateful workloads. This is by design. The operational knowledge required to manage stateful workloads is often bespoke and outside the scope of Kubernetes itself.

    The operator pattern is exactly the right abstraction.

    Understanding Kubernetes operators

    Kubernetes operators take the Kubernetes controller pattern that manages native Kubernetes resources (Pods, Deployments, Namespaces, Secrets, etc) and let you apply it to your own custom resources. Kubernetes extensibility is legendary and operators fit right in.

    If we need to define operators in one formula it would be:

    Operator = Custom Resource + Controller

    Custom Resources

    Custom resources are Kubernetes objects that you define via CRDs (Custom Resource Definitions). Once a CRD is defined, you can create custom resources based on the definition and they are stored by Kubernetes and you can interact with them through the Kubernetes API or kubectl, just like existing resources. Here is a CRD for a candy custom resource.

    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
      # name must match the spec fields below, and be in the form: <plural>.<group>
      name: candies.awesome.corp.com
      # group name to use for REST API: /apis/<group>/<version>
      group: awesome.corp.com
      # version name to use for REST API: /apis/<group>/<version>
        - name: v1
        # Each version can be enabled/disabled by served flag.
        served: true
        # One and only one version must be marked as the storage version.
        storage: true
            type: object
                type: object
                    type: string
    # either Namespaced or Cluster
    scope: Namespaced
      # plural name to be used in the URL: /apis/<group>/<version>/<plural>
      plural: candies
      # singular name to be used as an alias on the CLI and for display
      singular: candy
      # kind is normally the CamelCased singular type. Your resource manifests use this.
      kind: Candy
      # shortNames allow shorter string to match your resource on the CLI
        - cn

    Don’t be overwhelmed. At the end of the day it defines a simple object that has a name field and a flavor field. Everything else is needed to integrate with Kubernetes and kubectl. For example, the various names in the names section provide a good user experience when presenting information to the user. The schema section allows Kubernetes to validate on your behalf that Candy custom resources adhere to the requirements.

    Well, if CRDs look a little complicated the custom resources themselves are pretty straightforward. Here is chocolate candy custom resource:

    apiVersion: awesome.corp.com/v1
    kind: Candy
      name: chocolate
      flavor: sweeeeeeet

    Just with CRDs and custom resources you can take advantage of Kubernetes and abuse it as both a persistent database, a RESET API and a command-line client.

    That’s right. Kubernetes will store all your custom resources in etcd for you and provide CRUD access through its API as well as through kubectl.

    For example, we can create the chocolate custom resource via kubectl:

    $ kubectl create -f chocolate.yaml
    candy.awesome.corp.com/chocolate created

    Then, we can list all the candies just like any other resource:

    $ kubectl get candies
    NAME        AGE
    chocolate   2m

    We can get the contents as JSON too. Here, we use the short name cn:

    $ kubectl get cn -o json
      "apiVersion": "v1",
      "items": [
          "apiVersion": "awesome.corp.com/v1",
          "kind": "Candy",
          "metadata": {
            "creationTimestamp": "2020-02-08T10:22:25Z",
            "generation": 1,
            "name": "chocolate",
            "namespace": "default",
            "resourceVersion": "1664",
            "selfLink": "/apis/awesome.corp.com/v1/namespaces/default/candies/chocolate",
            "uid": "1b04f5a9-9ae8-475d-bc7d-245042759304"
          "spec": {
            "flavor": "sweeeeeeet"
      "kind": "List",
      "metadata": {
        "resourceVersion": "",
        "selfLink": ""

    In case you want to access it programmatically then there is a new Kubernetes API endpoint:


    Operator Controllers

    CRDs and custom resources are useful on their own, but when you write your own controllers to manage them you get to reap the real benefit.

    Specifically, operators consist of a controller that has one job - reconcile the desired state as specified in the spec of the custom resource.

    Let’s explain how operators work with our chocolate custom resource example.


    Imagine a chocolate factory. The sweetness spec for each chocolate bar is of course Sweeeeeeet . Our chocolate operator runs in our Kubernetes cluster. It is connected to the chocolate making machine where it can control for example, how much sugar to add. It can also sense the sweetness of each manufactured chocolate bar, by measuring small bits. If the actual sweetness doesn't match the spec, the specific chocolate bar will be disposed off because it didn’t pass quality control. The custom resource can stick around, but in its status it will record the actual sweetness and if the chocolate bar was disposed of or not.

    Other data analytics pipeline can query the custom resources and provide insights (e.g. a specific machine produces too many non-standard chocolate bars and must be calibrated or fixed).

    This way we can bring an external system of a chocolate factory into the fold of Kubernetes and interact with it using Kubernetes concepts and tooling.

    The Etcd Operator

    Let’s look at a real operator - the etcd operator. As you know, Kubernetes manages its state in an internal etcd cluster. But, Etcd is a general-purpose key-value store and you may want to install Etcd in your Kubernetes cluster for use by your workloads. It is possible to use the same Etcd cluster used by Kubernetes, but it is not a good idea because it’s considered an implementation detail of Kubernetes and also it’s configured for listening only on localhost.

    With the Etcd operators you can easily install and manage your own Etcd cluster and reap all the benefits. You can find the Etcd operator on OperatorHub.io, which is a community site that curates Kubernetes operators.

    Here are some the features you get out of the box:

    • High availability - Multiple instances of etcd are networked together and secured. Individual failures or networking issues are transparently handled to keep your cluster up and running.
    • Automated updates - Rolling out a new etcd version works like all Kubernetes rolling updates. Simply declare the desired version, and the etcd service starts a safe rolling update to the new version automatically.
    • Backups included - Create etcd backups and restore them through the etcd Operator.

    The Etcd operator manages 3 different CRDs: Cluster, Backup and Restore.

    Here is what a Etcd Cluster custom resource looks like:

    apiVersion: etcd.database.coreos.com/v1beta2
    kind: EtcdCluster
      name: example
      size: 3
      version: 3.2.13

    The spec has a size and version field. For example, by modifying the version field you can signal the operator that you want to upgrade your Etcd cluster. Upgrading safely a distributed data store is a non-trivial procedure, but the operator encapsulates all the knowledge and lets users just update one field in a YAML file, sit back and watch the magic happen.

    Let’s look at some code, just to get a sense of what operator code is like. The Etcd operator is implemented in Go and has multiple packages. Here is the heart of the operator - the reconcile() method of the Cluster type:

    func (c *Cluster) reconcile(pods []*v1.Pod) error {
      c.logger.Infoln("Start reconciling")
      defer c.logger.Infoln("Finish reconciling")

      defer func() {
      c.status.Size = c.members.Size()

      sp := c.cluster.Spec
      running := podsToMemberSet(pods, c.isSecureClient())
      if !running.IsEqual(c.members) || c.members.Size() != sp.Size {
      return c.reconcileMembers(running)

      if needUpgrade(pods, sp) {

      m := pickOneOldMember(pods, sp.Version)
      return c.upgradeOneMember(m.Name)


      return nil

    We’re not going to analyze each line, but the gist of it is that the operator checks the size field of the spec and compares it to the actual number of members in the cluster. If the numbers don’t match then the operator calls the reconcileMembers() method that resizes the cluster properly.

    Then it checks if an upgrade is required and if this is the case, the operator performs a rolling upgrade by upgrading one old member at a time until all members are at the new version.

    The operator also makes sure to always update the status to the actual state.

    Survey of Operator Frameworks

    Using operators is typically very simple because all the complexity is encapsulated by the operator. But, someone has to write the operator and deal with the complexities of stateful, async, distributed systems as well as integrate with the Kubernetes API machinery. This is not trivial. Luckily the Kubernetes community developed several frameworks to assist in writing Kubernetes controllers in general and operators in particular. Most of these frameworks are Go frameworks as Go is the implementation language of Kubernetes itself and the most high-fidelity client libraries are also implemented in Go. But, there is also one Python framework for you, pythonistas nad, one language-agnostic framework.


    Kubebuilder is a Go framework for building Kubernetes API extensions based on CRDs, controllers and webhook admission controls (to validate custom resources). It is developed by the Kubernetes API machinery work group. It can be considered the “official” way to build API extensions. In addition, it has a lot of momentum and it provides a lot of capabilities out of the box. It promotes the following workflow:

    1. Create a new project directory
    2. Create one or more resource APIs as CRDs and then add fields to the resources
    3. Implement reconcile loops in controllers and watch additional resources
    4. Test by running against a cluster (self-installs CRDs and starts controllers automatically)
    5. Update bootstrapped integration tests to test new fields and business logic
    6. Build and publish a container from the provided Dockerfile

    Under the covers Kubebuilder is using the controller-runtime library for a lot of the heavy lifting.

    There is an entire book about Kubebuilder that you can pursue: https://book.kubebuilder.io/

    Operator Framework

    The operator framework is another mature framework. It was originally developed by CoreOS, the originators of the operator concept. It is still going strong and has excellent documentation as well as a lot of components. One of the core components is the OperatorSDK. However, there is an integration project going on to merge Kubebuilder and the OperatorSDK.

    The OperatorSDK is also built on top of the controller-runtime. If and when Kubebuilder assimilates the OperatorSDK it is not clear what would be the future of the Operator framework as a whole.


    The Metacontroller framework is different. It is built on the concept of web hooks. Those web hooks are served by a lambda controller that runs in Kubernetes and invokes your lambda functions that can be implemented in any language. You get a lot of flexibility and can implement your controllers in any language at the cost of an additional layer of indirection.

    Kopf - Kubernetes Operators Framework

    Kopf is a Python operator framework that makes development very Pythonic. Kopf provides both the “outer” toolkit to interact with Kubernetes, deploy your operators and run them in the cluster as well as “inner” libraries to manipulate Kubernetes resources and in particular custom resources.


    Operators are an extremely powerful pattern for managing stateful applications in Kubernetes. The conceptual model follows control theory. The utility of the operator pattern became clear as soon as CoreOS introduced it to the world and a plethora of operators are now available. You may consider building operators for your system and if you do, there are a variety of frameworks and tools to assist you along the way.

    Plug: Keep your K8s clusters reliable with Squadcast

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    Written By:
    May 27, 2020
    May 27, 2020
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    More from
    Gigi Sayfan
    Understanding the landscape of AWS compute
    Understanding the landscape of AWS compute
    July 10, 2020
    SLOs for AWS-based infrastructure
    SLOs for AWS-based infrastructure
    July 8, 2020
    Using observability tools to set SLOs for Kubernetes Applications
    Using observability tools to set SLOs for Kubernetes Applications
    April 16, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    have to say
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    Copyright © Squadcast Inc. 2017-2023