🚀 Take control of your Incident Management process with Squadcast's new Audit Logs feature.

The Age of Service Mesh

Nov 28, 2019
Last Updated:
May 2, 2024
Share this post:
The Age of Service Mesh

There has been some hype around service meshes for a while now. But what are they and why is it needed? In this article, Gigi Sayfan, a Principal Software Architect and author of “Mastering Kubernetes” explores the what, why of service mesh and how it works with Kubernetes

Table of Contents:

    Overview

    You have built a massively successful system. The users just can't get enough and request new features. Your developers crank out new services on a regular basis. Your DevOps/SRE team configures and scale your Kubernetes cluster (or clusters). As the system becomes more complicated and sophisticated you realize that there are common themes that repeat across all your services:

    - Advanced load balancing

    - Service discovery

    - Support canary deployments

    - Caching

    - Tracing a request across multiple microservices

    - Authentication between services

    - Limiting the number of requests a service handles at a given time

    - Automatically retrying failed requests

    - Failing over to alternative component when a component fails consistently

    - Collecting metrics on traffic

    You quickly realize that all these concerns are shared by all your services. Kubernetes helps with some of these concerns like service discovery and load balancing, but you often need more powerful support. You definitely don't want to implement them in each service separately. The traditional way of addressing these issues is to write a big library that all services use. This is a reasonable approach, but there is a better way - the service mesh.

    In this article we will explain what is a service mesh, why it is such an important trend, how service mesh works with Kubernetes and then we'll review some amongst the plethora of existing service meshes and discuss their relative pros and cons.

    Let's get going...

    What's a service mesh?

    Service mesh is an architectural pattern for large-scale cloud native applications that are composed of many microservices. A lot of stuff happens between services. The service mesh externalizes all these concerns outside your application services and manages them centrally using proxies that intercept all traffic between services. Then you configure the service mesh to perform all the cool stuff on your behalf such traffic shaping, security and observability.

    Here is what a service mesh look like:

    Note that service meshes are not unique to Kubernetes. Here we focus on Kubernetes, but many of the concepts translate to other systems with a large number of interacting components.

    Proxies

    What's a proxy? A proxy is a component that sits in front of a service. When other services talk to your service they go through the proxy that can do various things like just pass through the request, send it somewhere else, reject it or even modify it. This is similar in spirit to Kubernetes admission controllers.

    There are two primary ways to deploy a service mesh into your Kubernetes cluster.

    Sidecar containers

    The sidecar container approach injects a proxy container into every pod.

    Here is a diagram that shows a service mesh that use sidecar containers:

    Some of the attributes of sidecar containers are:

    - No need to deploy an agent on each node

    - Ability to deploy different pods with different sidecars (or versions) on the same node

    - Each pod has its own copy of the proxy

    Are those attributes pros or cons? that depends on context. For example, as an administrator you may prefer to be oblivious to the service mesh or alternatively you may want to control exactly what's going on at the network management level.

    Node agents

    The node agent approach installs a single agent on each node that intercepts the traffic and performs the routing and other service mesh functions.

    Here is a diagram that shows a service mesh that uses node agents:

    Some of the attributes of the node agent proxies are:

    - More universal (doesn't require Kubernetes)

    - More control over the service mesh proxies

    - More efficient (no need for deploying a proxy per pod)

    - Requires separate installation and maintenance

    Data plane vs. control plane

    When thinking about service mesh there are two separate aspects - the data plane and the control plane. The data plane is the set of proxies that connect your services (either as sidecar containers or node agents). The control plane as the name suggests controls the proxies that comprise the data plane. It is often a set of APIs and tools to configure policies, collect metrics and get aggregated view of your service mesh.

    Service mesh on Kubernetes

    Let's review some of the benefits that a service mesh can bring to your Kubernetes cluster!

    Advanced load balancing

    Kubernetes services provide a basic form of load balancing where a pool of backing pods serve requests coming into the service. You can even implement using services and labels simple canary deployments. If you want 10% of your requests to go to version 2 of a service you can deploy 9 pods with version 1 and one pod with version 2. But, with a service mesh you can do much more advanced load balancing that operates at the request level and not at the pod level. You can also do load balancing based on request path and parameters or use different algorithms like least number of connections for super fine-grained control.

    Authentication and authorization

    Authentication between services is important for security in depth. Kubernetes provides strong authentication and authorization around access to cluster resources and network policies, but a service mesh can take it to the next level with automatic mutual TLS and custom authorization.

    Circuit breaking

    Circuit breaking takes an unresponsive instance out of circulation. That helps prevent long delays by constantly retrying to reach an overloaded or dead pod. Kubernetes has some decent support for unresponsive pods with health checks and readiness probes. But, if the problem is misconfiguration or problem within the service itself Kubernetes can't help much. A service mesh operates at a higher level of abstraction and can do circuit breaking basked on the results of requests.

    Rate limiting

    Rate limiting is important to protect against denial of service attacks where attackers bombard your system with lots of requests, hoping to bring it down to its knees via resource exhaustion. It also helps to avoid paying enormous bills if you misconfigure your system or a load test goes out of control. Another use case is to prevent cascading failures where excessive load on one service propagates to lots of internal services.

    A service mesh lets you define and control those limits centrally and without impacting the services themselves.

    Kubernetes doesn't provide any built-in help here.

    Retries and failovers

    Building distributed systems is all about building a reliable system out of unreliable components. In a large microservice-based distributed system some services may be unreachable temporarily due to networking issues, maintenance or upgrades.

    A service mesh can be configured to automatically retry failed requests. Retries address temporary intermittent failures in a smooth and streamlined way. However, if a service is down or unreachable for a prolonged period of time it is often best to fail over to alternative location (e.g. in another region).

    Kubernetes has the concepts of automatically restarting failed containers and replica sets/deployments ensures enough pods are always running. But, as long as the pods and containers are running it will not help retry requests or fail over in case of consistent failures.

    Caching

    Caching can be a great performance enhancer and money saver. Especially for read-heavy workloads. A service mesh can be configured to cache the results of previous requests and return them instead of bothering the service. It may be even more powerful for serverless functions where each invocation may carry overhead.

    Again, no assistance from Kubernetes on this front. Some ingress controllers can provide caching support.

    Metrics

    Metrics are one of the cornerstones of observability. A service mesh is aware of all traffic between services and can collect a lot of useful metrics automatically. Kubernetes provides the metrics server that collects CPU and memory usage for pods and containers. It is used by the horizontal pod autoscaler and the kubectl top command. You can also record custom metrics, but it will not do it for you. A service mesh can be configured to collect request-level metrics.

    Distributed tracing

    Debugging and troubleshooting a distributed system made of many microservices is not easy. A request often travels across multiple services. Distributed tracing (yet another observability cornerstone) lets you track the path of the request across all those services. Kubernetes doesn't have any built-in distributed tracing capability although multiple projects provide solutions for Kubernetes. Service mesh can integrate with those solution like Jaeger or OpenZipkin and help you to figure out what's wrong when things go south.

    Who needs a service mesh?

    OK. Now, we get what a service mesh is. But, do you really need one?

    Yes, you do!

    If you build and manage a large-scale cloud-native application you want many if not all the capabilities of the service mesh. Let's see why.

    Aspect-oriented programming for the cloud

    When you write a microservice the  actual logic can be very minimal. Your system is comprised of a large number of relatively simple components. Even microservices that perform complex computations typically utilize libraries for the heavy lifting. The code for the service itself could be extremely simple, but when you add all the important security, observability and reliability aspects the code can balloon. All those critical aspects have nothing to do with the functionality of the service itself. They are all orthogonal operational concerns. They are a burden for the developers of the service. This is reminiscent of Aspect-oriented programming.

    The service mesh allows the same benefits, but actually makes it easier because it can be bolted on completely transparently without changes to the application.

    Service mesh vs. the big client library

    Before the age of the service mesh, big client libraries ruled the land and centralized all those operational concerns. Every service had to include those libraries and use them in the same way. Some examples are Hystrix from Netflix (Java) and Finagle from Twitter (Scala targeting the JVM).

    Here is what a system where services use a big client library looks like:

    The library approach works, but forces you to make a hard choice - either you limit your microservice implementations to a single programming language or you have to develop and support this important library for multiple programming languages. For large organizations, the single programming language approach is often unacceptable due to existing legacy code or acquisitions.

    The other major problem with the library approach is that when you make changes to the library you must upgrade ALL your services to use the latest version or suffer the consequences of incompatible services. In some cases like fixing security issues it's a hard requirement.

    Upgrading all services for large systems is often a serious project and always disruptive to the developers.

    With a service mesh you have no programming language limitations and upgrades can be done mostly transparently by cluster operators without upgrading and redeploying services.

    Service mesh vs. serverless computing

    Serverless is the new buzzword. There are two types of serverless:

    1. You don't have to manage your servers or your nodes in the case of Kubernetes

    2. Function as a service (a.k.a FaaS)

    The first type is implemented on Kubernetes by supporting cluster autoscaling. If your cluster needs more nodes they are added to the cluster automatically. Since services and pods on Kubernetes normally don't care which node they run in then service mesh works pretty much the same. Both sidecar containers and node agents (deployed as a DaemonSet).

    The second type of function as a service is a little more nuanced. There are many implementations of FaaS on Kubernetes. They are implemented in different ways and the details matter for service mesh. Some of the most common implementations like Kubeless and Fission already integrate with the Istio service mesh.

    The bottom line is that on Kubernetes there isn't too much of a difference between services and serverless functions as a service. Services are best for long-running processes and serverless functions are better suited for event-driven invocations. Both can benefit from a service mesh.

    Quick review of service meshes

    Let's do a quick review of field. There are many service meshes for Kubernetes out there with interesting relationships between them.

    Envoy

    Envoy is a very versatile and high-performance L7 proxy developed by Lyft. It provides many service mesh capabilities, but is considered difficult to configure. Many other service meshes for Kubernetes are built on top of Envoy. The Envoy project itself recommends using other open source projects like Ambassador and Gloo as an Ingress controller and/or API gateway on Kubernetes.

    Istio

    Istio is arguably the most popular service mesh on Kubernetes. It is built on top of Envoy and provides a Kubernetes-friendly (YAML manifests) way to configure it. Istio was started by Google, IBM and Lyft. It is super easy (one click) to install on Google GKE and it captured a lot of mindshare.

    Linkerd 2

    Linkerd 2 is a service mesh developed by Buoyant. Buoyant coined the term service mesh and introduced it to the world a few years ago. They initially developed Linkerd as a Scala-based service mesh for multiple platforms including Kubernetes. But, they decided to develop a better and faster product more suitable for Kubernetes. That's where Linkerd 2 comes in. The data plane (proxy layer) of Linkerd 2 is implemented in Rust and the control plane in Go. It is one of a rare few service meshes that don't rely on Envoy.

    Kuma

    Kuma is a service mesh developed by Kong. It is also built-on top of Envoy. According to the Kuma team it is simpler than Istio on Kubernetes. It can also work in other environments besides Kubernetes.

    Maesh

    Maesh is an interesting service mesh from the creators of Traefic. It is using the node agents approach. It draws its capabilities from Traefic middleware and you can configure it by using annotations.

    AWS App Mesh

    App Mesh is a dedicated App Mesh for AWS. It supports EC2, Fargate, ECS and EKS and plain Kubernetes. It is also built on top of Envoy and may be a good option if you want a service mesh that is deeply integrated with AWS services. It lags behind Istio as far as features and maturity.

    Network Service mesh

    All the service meshes we discussed so far operate at the L4 (TCP, UDP) or L7 (HTTP, HTTP/2) of the network stack. The Network service mesh is quite different. It operates at the L2/L3 level and is designed to bring advanced networking capabilities to Kubernetes:

    - Heterogeneous network configurations

    - Tunneling and networking context as first-class citizens

    - Policy-driven service function chaining

    - On-demand, dynamic, negotiated connections

    - Exotic protocols

    Service mesh alternatives

    A service mesh is super useful, but if you don't use its capabilities it might just introduce an extra layer of indirection and complexity. If your use case is more lightweight you get all you need a decent API gateway or sophisticated ingress controller. Some options are:

    - Traefic

    - Gloo

    - Ambassador

    - Contour

    - Knative

    Conclusion

    Service meshes are an exciting technology. They provide real benefits for complicated distributed systems. Kubernetes provides a solid container orchestration platform and leaves many opportunities for service mesh to provide added value. In the future, I believe that service meshes will become a stable for well-architected distributed systems.

    Plug: Keep your K8s clusters reliable with Squadcast

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    What you should do now
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Schedule a demo with Squadcast to learn about the platform, answer your questions, and evaluate if Squadcast is the right fit for you.
    • Curious about how Squadcast can assist you in implementing SRE best practices? Discover the platform's capabilities through our Interactive Demo.
    • Enjoyed the article? Explore further insights on the best SRE practices.
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    • Get a walkthrough of our platform through this Interactive Demo and see how it can solve your specific challenges.
    • See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management
    • Share this blog post with someone you think will find it useful. Share it on Facebook, Twitter, LinkedIn or Reddit
    What you should do now?
    Here are 3 ways you can continue your journey to learn more about Unified Incident Management
    Discover the platform's capabilities through our Interactive Demo.
    See how Charter Leveraged Squadcast to Drive Client Success With Robust Incident Management.
    Share the article
    Share this blog post on Facebook, Twitter, Reddit or LinkedIn.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare our plans and find the perfect fit for your business.
    See Redis' Journey to Efficient Incident Management through alert noise reduction With Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Compare Squadcast & PagerDuty / Opsgenie
    Compare and see if Squadcast is the right fit for your needs.
    Compare our plans and find the perfect fit for your business.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    Discover the platform's capabilities through our Interactive Demo.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Learn how Scoro created a solid foundation for better on-call practices with Squadcast.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Discover the platform's capabilities through our Interactive Demo.
    Enjoyed the article? Explore further insights on the best SRE practices.
    We’ll show you how Squadcast works and help you figure out if Squadcast is the right fit for you.
    Experience the benefits of Squadcast's Incident Management and On-Call solutions firsthand.
    Enjoyed the article? Explore further insights on the best SRE practices.
    Written By:
    November 28, 2019
    November 28, 2019
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now
    ant-design-linkedIN

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQs
    More from
    Gigi Sayfan
    Understanding the landscape of AWS compute
    Understanding the landscape of AWS compute
    July 10, 2020
    SLOs for AWS-based infrastructure
    SLOs for AWS-based infrastructure
    July 8, 2020
    Kubernetes Operators for Automated SRE
    Kubernetes Operators for Automated SRE
    May 27, 2020
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2024 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Enterprise Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2024
    Blog
    SRE
    The Age of Service Mesh

    The Age of Service Mesh

    Gigi Sayfan
    Gigi Sayfan
    November 28, 2019
    The Age of Service Mesh

    Overview

    You have built a massively successful system. The users just can't get enough and request new features. Your developers crank out new services on a regular basis. Your DevOps/SRE team configures and scale your Kubernetes cluster (or clusters). As the system becomes more complicated and sophisticated you realize that there are common themes that repeat across all your services:

    - Advanced load balancing

    - Service discovery

    - Support canary deployments

    - Caching

    - Tracing a request across multiple microservices

    - Authentication between services

    - Limiting the number of requests a service handles at a given time

    - Automatically retrying failed requests

    - Failing over to alternative component when a component fails consistently

    - Collecting metrics on traffic

    You quickly realize that all these concerns are shared by all your services. Kubernetes helps with some of these concerns like service discovery and load balancing, but you often need more powerful support. You definitely don't want to implement them in each service separately. The traditional way of addressing these issues is to write a big library that all services use. This is a reasonable approach, but there is a better way - the service mesh.

    In this article we will explain what is a service mesh, why it is such an important trend, how service mesh works with Kubernetes and then we'll review some amongst the plethora of existing service meshes and discuss their relative pros and cons.

    Let's get going...

    What's a service mesh?

    Service mesh is an architectural pattern for large-scale cloud native applications that are composed of many microservices. A lot of stuff happens between services. The service mesh externalizes all these concerns outside your application services and manages them centrally using proxies that intercept all traffic between services. Then you configure the service mesh to perform all the cool stuff on your behalf such traffic shaping, security and observability.

    Here is what a service mesh look like:

    Note that service meshes are not unique to Kubernetes. Here we focus on Kubernetes, but many of the concepts translate to other systems with a large number of interacting components.

    Proxies

    What's a proxy? A proxy is a component that sits in front of a service. When other services talk to your service they go through the proxy that can do various things like just pass through the request, send it somewhere else, reject it or even modify it. This is similar in spirit to Kubernetes admission controllers.

    There are two primary ways to deploy a service mesh into your Kubernetes cluster.

    Sidecar containers

    The sidecar container approach injects a proxy container into every pod.

    Here is a diagram that shows a service mesh that use sidecar containers:

    Some of the attributes of sidecar containers are:

    - No need to deploy an agent on each node

    - Ability to deploy different pods with different sidecars (or versions) on the same node

    - Each pod has its own copy of the proxy

    Are those attributes pros or cons? that depends on context. For example, as an administrator you may prefer to be oblivious to the service mesh or alternatively you may want to control exactly what's going on at the network management level.

    Node agents

    The node agent approach installs a single agent on each node that intercepts the traffic and performs the routing and other service mesh functions.

    Here is a diagram that shows a service mesh that uses node agents:

    Some of the attributes of the node agent proxies are:

    - More universal (doesn't require Kubernetes)

    - More control over the service mesh proxies

    - More efficient (no need for deploying a proxy per pod)

    - Requires separate installation and maintenance

    Data plane vs. control plane

    When thinking about service mesh there are two separate aspects - the data plane and the control plane. The data plane is the set of proxies that connect your services (either as sidecar containers or node agents). The control plane as the name suggests controls the proxies that comprise the data plane. It is often a set of APIs and tools to configure policies, collect metrics and get aggregated view of your service mesh.

    Service mesh on Kubernetes

    Let's review some of the benefits that a service mesh can bring to your Kubernetes cluster!

    Advanced load balancing

    Kubernetes services provide a basic form of load balancing where a pool of backing pods serve requests coming into the service. You can even implement using services and labels simple canary deployments. If you want 10% of your requests to go to version 2 of a service you can deploy 9 pods with version 1 and one pod with version 2. But, with a service mesh you can do much more advanced load balancing that operates at the request level and not at the pod level. You can also do load balancing based on request path and parameters or use different algorithms like least number of connections for super fine-grained control.

    Authentication and authorization

    Authentication between services is important for security in depth. Kubernetes provides strong authentication and authorization around access to cluster resources and network policies, but a service mesh can take it to the next level with automatic mutual TLS and custom authorization.

    Circuit breaking

    Circuit breaking takes an unresponsive instance out of circulation. That helps prevent long delays by constantly retrying to reach an overloaded or dead pod. Kubernetes has some decent support for unresponsive pods with health checks and readiness probes. But, if the problem is misconfiguration or problem within the service itself Kubernetes can't help much. A service mesh operates at a higher level of abstraction and can do circuit breaking basked on the results of requests.

    Rate limiting

    Rate limiting is important to protect against denial of service attacks where attackers bombard your system with lots of requests, hoping to bring it down to its knees via resource exhaustion. It also helps to avoid paying enormous bills if you misconfigure your system or a load test goes out of control. Another use case is to prevent cascading failures where excessive load on one service propagates to lots of internal services.

    A service mesh lets you define and control those limits centrally and without impacting the services themselves.

    Kubernetes doesn't provide any built-in help here.

    Retries and failovers

    Building distributed systems is all about building a reliable system out of unreliable components. In a large microservice-based distributed system some services may be unreachable temporarily due to networking issues, maintenance or upgrades.

    A service mesh can be configured to automatically retry failed requests. Retries address temporary intermittent failures in a smooth and streamlined way. However, if a service is down or unreachable for a prolonged period of time it is often best to fail over to alternative location (e.g. in another region).

    Kubernetes has the concepts of automatically restarting failed containers and replica sets/deployments ensures enough pods are always running. But, as long as the pods and containers are running it will not help retry requests or fail over in case of consistent failures.

    Caching

    Caching can be a great performance enhancer and money saver. Especially for read-heavy workloads. A service mesh can be configured to cache the results of previous requests and return them instead of bothering the service. It may be even more powerful for serverless functions where each invocation may carry overhead.

    Again, no assistance from Kubernetes on this front. Some ingress controllers can provide caching support.

    Metrics

    Metrics are one of the cornerstones of observability. A service mesh is aware of all traffic between services and can collect a lot of useful metrics automatically. Kubernetes provides the metrics server that collects CPU and memory usage for pods and containers. It is used by the horizontal pod autoscaler and the kubectl top command. You can also record custom metrics, but it will not do it for you. A service mesh can be configured to collect request-level metrics.

    Distributed tracing

    Debugging and troubleshooting a distributed system made of many microservices is not easy. A request often travels across multiple services. Distributed tracing (yet another observability cornerstone) lets you track the path of the request across all those services. Kubernetes doesn't have any built-in distributed tracing capability although multiple projects provide solutions for Kubernetes. Service mesh can integrate with those solution like Jaeger or OpenZipkin and help you to figure out what's wrong when things go south.

    Who needs a service mesh?

    OK. Now, we get what a service mesh is. But, do you really need one?

    Yes, you do!

    If you build and manage a large-scale cloud-native application you want many if not all the capabilities of the service mesh. Let's see why.

    Aspect-oriented programming for the cloud

    When you write a microservice the  actual logic can be very minimal. Your system is comprised of a large number of relatively simple components. Even microservices that perform complex computations typically utilize libraries for the heavy lifting. The code for the service itself could be extremely simple, but when you add all the important security, observability and reliability aspects the code can balloon. All those critical aspects have nothing to do with the functionality of the service itself. They are all orthogonal operational concerns. They are a burden for the developers of the service. This is reminiscent of Aspect-oriented programming.

    The service mesh allows the same benefits, but actually makes it easier because it can be bolted on completely transparently without changes to the application.

    Service mesh vs. the big client library

    Before the age of the service mesh, big client libraries ruled the land and centralized all those operational concerns. Every service had to include those libraries and use them in the same way. Some examples are Hystrix from Netflix (Java) and Finagle from Twitter (Scala targeting the JVM).

    Here is what a system where services use a big client library looks like:

    The library approach works, but forces you to make a hard choice - either you limit your microservice implementations to a single programming language or you have to develop and support this important library for multiple programming languages. For large organizations, the single programming language approach is often unacceptable due to existing legacy code or acquisitions.

    The other major problem with the library approach is that when you make changes to the library you must upgrade ALL your services to use the latest version or suffer the consequences of incompatible services. In some cases like fixing security issues it's a hard requirement.

    Upgrading all services for large systems is often a serious project and always disruptive to the developers.

    With a service mesh you have no programming language limitations and upgrades can be done mostly transparently by cluster operators without upgrading and redeploying services.

    Service mesh vs. serverless computing

    Serverless is the new buzzword. There are two types of serverless:

    1. You don't have to manage your servers or your nodes in the case of Kubernetes

    2. Function as a service (a.k.a FaaS)

    The first type is implemented on Kubernetes by supporting cluster autoscaling. If your cluster needs more nodes they are added to the cluster automatically. Since services and pods on Kubernetes normally don't care which node they run in then service mesh works pretty much the same. Both sidecar containers and node agents (deployed as a DaemonSet).

    The second type of function as a service is a little more nuanced. There are many implementations of FaaS on Kubernetes. They are implemented in different ways and the details matter for service mesh. Some of the most common implementations like Kubeless and Fission already integrate with the Istio service mesh.

    The bottom line is that on Kubernetes there isn't too much of a difference between services and serverless functions as a service. Services are best for long-running processes and serverless functions are better suited for event-driven invocations. Both can benefit from a service mesh.

    Quick review of service meshes

    Let's do a quick review of field. There are many service meshes for Kubernetes out there with interesting relationships between them.

    Envoy

    Envoy is a very versatile and high-performance L7 proxy developed by Lyft. It provides many service mesh capabilities, but is considered difficult to configure. Many other service meshes for Kubernetes are built on top of Envoy. The Envoy project itself recommends using other open source projects like Ambassador and Gloo as an Ingress controller and/or API gateway on Kubernetes.

    Istio

    Istio is arguably the most popular service mesh on Kubernetes. It is built on top of Envoy and provides a Kubernetes-friendly (YAML manifests) way to configure it. Istio was started by Google, IBM and Lyft. It is super easy (one click) to install on Google GKE and it captured a lot of mindshare.

    Linkerd 2

    Linkerd 2 is a service mesh developed by Buoyant. Buoyant coined the term service mesh and introduced it to the world a few years ago. They initially developed Linkerd as a Scala-based service mesh for multiple platforms including Kubernetes. But, they decided to develop a better and faster product more suitable for Kubernetes. That's where Linkerd 2 comes in. The data plane (proxy layer) of Linkerd 2 is implemented in Rust and the control plane in Go. It is one of a rare few service meshes that don't rely on Envoy.

    Kuma

    Kuma is a service mesh developed by Kong. It is also built-on top of Envoy. According to the Kuma team it is simpler than Istio on Kubernetes. It can also work in other environments besides Kubernetes.

    Maesh

    Maesh is an interesting service mesh from the creators of Traefic. It is using the node agents approach. It draws its capabilities from Traefic middleware and you can configure it by using annotations.

    AWS App Mesh

    App Mesh is a dedicated App Mesh for AWS. It supports EC2, Fargate, ECS and EKS and plain Kubernetes. It is also built on top of Envoy and may be a good option if you want a service mesh that is deeply integrated with AWS services. It lags behind Istio as far as features and maturity.

    Network Service mesh

    All the service meshes we discussed so far operate at the L4 (TCP, UDP) or L7 (HTTP, HTTP/2) of the network stack. The Network service mesh is quite different. It operates at the L2/L3 level and is designed to bring advanced networking capabilities to Kubernetes:

    - Heterogeneous network configurations

    - Tunneling and networking context as first-class citizens

    - Policy-driven service function chaining

    - On-demand, dynamic, negotiated connections

    - Exotic protocols

    Service mesh alternatives

    A service mesh is super useful, but if you don't use its capabilities it might just introduce an extra layer of indirection and complexity. If your use case is more lightweight you get all you need a decent API gateway or sophisticated ingress controller. Some options are:

    - Traefic

    - Gloo

    - Ambassador

    - Contour

    - Knative

    Conclusion

    Service meshes are an exciting technology. They provide real benefits for complicated distributed systems. Kubernetes provides a solid container orchestration platform and leaves many opportunities for service mesh to provide added value. In the future, I believe that service meshes will become a stable for well-architected distributed systems.

    Plug: Keep your K8s clusters reliable with Squadcast

    Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

    Written By:
    Gigi Sayfan
    Gigi Sayfan
    November 28, 2019
    SRE
    Observability
    Kubernetes
    Share this blog:
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.
    Get reliability insights delivered straight to your inbox.
    Get ready for the good stuff! No spam, no data sale and no promotion. Just the awesome content you signed up for.
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    If you wish to unsubscribe, we won't hold it against you. Privacy policy.