Kubernetes Monitoring Best Practices

July 2, 2023
Share this post:
Kubernetes Monitoring Best Practices
Table of Contents:

    Kubernetes can be installed using different tools, whether open-source, third-party vendor, or in a public cloud. In most cases, default installations have limited monitoring capabilities. Therefore, once a Kubernetes cluster is running, administrators must implement monitoring solutions to meet their requirements.

    Typical use cases for Kubernetes monitoring include:  

    • Ensuring workload reliability
    • Achieving high-level visibility into your workload
    • Alerting and enabling incident management

    Effective Kubernetes monitoring requires a mix of tools, strategy, and technical expertise. To help you get it right, this article will explore seven essential Kubernetes monitoring best practices in detail.

    Summary of key Kubernetes monitoring best practices concepts

    The table below summarizes the Kubernetes monitoring best practices we will explore in this article.

    Concept Description
    Monitoring vs. observability Observability means gaining insights into your workload’s performance using external indicators. Monitoring means checking such indicators over time.
    Determining requirements Accurately determine your requirements and monitoring goals.
    Identify appropriate metrics Identify which metrics you need to achieve your monitoring goals.
    Select the right tools Selecting the right tools given your requirements is a critical best practice. A main decision here is whether to build something in-house using open-source software or buy a more complete SaaS solution with better support.
    Monitor the monitoring system In a production workload, it is important to monitor the monitoring system itself to ensure it is reliable and highly available.
    Consider data storage Monitoring data must be stored and managed effectively.
    Monitor the control plane Monitoring the Kubernetes control plane is easy to overlook, so teams should be intentional about control plane monitoring.
    Account for incident response Monitoring outputs can enhance incident response coordination, which can reduce MTTR (mean time to resolve).

    Monitoring vs. observability

    Before we go into more detail, let’s unpack an often confusing topic, monitoring vs. observability. The term “monitoring” is more traditional and covers the collection of metrics and logs used to monitor the application infrastructure components. The idea is to “monitor” a workload by constantly evaluating the real-time performance of its underlying infrastructure.

    Observability is a relatively new concept, and even though it overlaps with monitoring, its end goal is to isolate a performance bottleneck along a transaction path instead of monitoring the application infrastructure. Observability gained traction in application environments designed based on the paradigm of microservices, where an application comprises modularized services hosted in ephemeral containers and interacting with each other via application programming interfaces (API). In such an environment, monitoring the servers and containers in isolation isn’t meaningful, so a new perspective was required, giving rise to the notion of observability.  

    In addition to metrics and logs, observability also includes distributed tracing to follow the path of a transaction through the application infrastructure. Distributed tracing enables operation engineers to understand the path a user’s request takes, including:

    • When the workload received the request,
    • The stages or microservices it went through, and
    • When the response was sent to the user.

    Observability allows the operations engineers to quickly understand the upstream and downstream impact of application services on each other. Typically, observability tools will combine metrics, logs, and tracing to give engineers a coherent view of the entire transaction path across the infrastructure. Read this article if you want to learn more about observability (also called “O11y”).

    7 essential Kubernetes monitoring best practices

    The seven Kubernetes monitoring best practices below can help DevOps and SRE (site reliability engineering) teams achieve SLOs (service level objectives) and improve overall infrastructure observability.

    Kubernetes monitoring best practice #1: Determine what you want to achieve

    Determining business goals is the first (and arguably the most important) Kubernetes monitoring best practice. Examples of such goals are:

    • Gain visibility into your cluster’s health
    • Gain visibility into the end user’s experience
    • Be alerted when certain events occur
    • Anticipate potential problems
    • Identify trends and patterns in the workload utilization, such as a steady increase in disk utilization which will lead to the disk being full within a certain period of time
    • Identify trends and patterns going out of the ordinary or out of what’s expected
    • Scale pods in and out when certain conditions are met
    • Evaluate the reliability of the application against expected criteria

    While planning is important, it’s also essential not to overthink it. Teams just getting started with monitoring should avoid analysis paralysis and instead take an iterative approach to developing a plan. Additional requirements can be added later to address new information and requirements.

    Integrated full stack reliability management platform
    Try for free
    Drive better business outcomes with incident analytics, reliability insights, SLO tracking, and error budgets
    Manage incidents on the go with native iOS and Android mobile apps
    Seamlessly integrated alert routing, on-call, and incident response
    Try for free

    Kubernetes monitoring best practice #2: Identify metrics to monitor

    Once you’ve identified your business goals, you can identify which metrics you need to collect to achieve those goals. This step also includes defining related configuration parameters, such as the collection rate and how long you need to store the metrics data.

    Some metrics are usually readily available, typically system metrics. These metrics include:

    • CPU utilization
    • Memory utilization
    • Free space available on disks
    • Disk input/output data
    • Network usage

    System metrics are usually necessary as part of any monitoring strategy and tend to show the overall stress the cluster is under. However, they are quite basic and usually won’t give enough actionable information beyond telling whether the cluster seems healthy.

    Additionally, more complex metrics are often required. These metrics are often tied to the software you run. For example, they could measure:

    • How responsive is the website or the app?
    • How many users are currently logged in?
    • What is the average number of concurrent users at 10am on weekdays?
    • How fast does your support team respond to the initial requests?
    • What’s the rate of 5xx errors reported by your web server?
    • What’s the average number of jobs in an input queue per day?
    Kubernetes metrics dashboard
    Kubernetes metrics dashboard (source)

    Kubernetes monitoring best practice #3: Select the right tools

    Our next Kubernetes monitoring best practice is selecting the right tools based on the required metrics and achieving your monitoring goals.

    Free and Open Source Software (FOSS) vs. commercial third-party software is commonly used to categorize Kubernetes monitoring tools. Some examples of FOSS monitoring solutions include:

    • Tools to collect metrics (e.g.,  Prometheus, kube-metrics)
    • Tools to collect logs (e.g., Loki, Fluentd)
    • Tools to collect traces (e.g., Jaeger)
    • Tools for visualization and alerting (e.g., Grafana, Alertmanager)

    While plenty of open-source options are available, you will need in-house expertise and a significant amount of DevOps engineers’ time to build and maintain a FOSS monitoring solution. If you don’t have in-house experts, you can hire consultants to build a solution, but this will likely be expensive. On the other hand, developing your own monitoring solution could save you a considerable amount of money in the long-run.

    The alternative is to pay for third-party software, which usually offers turn-key, software-as-a-service (SaaS) solutions. Commercial options typically have more advanced products, such as machine learning to detect suspicious trends and patterns or perform offline data analysis. Additionally, most commercial solutions come with a level of support that FOSS projects lack.

    When evaluating solutions, remember that using third-party tools (especially SaaS products) can create compliance issues, such as safeguarding personally identifiable information under HIPAA or GDPR. You might also need to open your cluster to allow routes from the internet for the third-party SaaS products, which increases attack surface and could create other security issues.

    Kubernetes monitoring best practice #4: Monitor your monitoring system

    Unless you run a non-production workload, you probably want every element of your monitoring solution to be highly available and scalable. Achieving high-availability monitoring requires monitoring of the monitoring system itself. At a minimum, you must be able to detect critical failures in your monitoring system and send notifications when they occur. Ideally, you should also configure automated remediation of such problems.

    Generally speaking, this extra level of monitoring is required only for in-house solutions, as third-party SaaS vendors usually have monitoring systems for their platforms. Some FOSS products incorporate their own monitoring systems. For example, Loki comes with Loki Canary, which regularly sends dummy logs to Loki and reads them back to ensure it works fine.

    Kubernetes monitoring best practice #5: Consider data storage

    Your monitoring system will accumulate data over time, and this data should be managed like any other data. You will need to determine how long you need to hold onto it, maybe even put it in cold storage after a while. Be sure to consider any regulations or legal requirements applicable to your organization so that the data can be accessed and provided quickly if requested. Determining your data retention requirements for your monitoring data will be part of your overall requirement gathering exercise, and you will then need to implement it accordingly.

    Kubernetes monitoring best practice #6: Monitor the control plane

    Do not neglect monitoring your control plane as well! All the best practices we have listed also apply to the control plane, not just the data plane. Some Kubernetes managed solutions, such as Amazon’s EKS, will do that automatically for you. If not, you will need to add the monitoring of the control plane nodes and the various control plane components into your monitoring strategy.

    Kubernetes monitoring best practice #7: Account for incident response

    Once your monitoring system is up and running and able to send alerts to your team, you must consider how to respond to such alerts. Squadcast can help coordinate incident responses, ensuring a very high level of coordination within your team so they can be as efficient as possible while dealing with the problem.

    Integrating monitoring data into a robust incident response strategy helps teams detect and recover from outages and other production-disrupting incidents faster. As a result, MTTR decreases, and uptime improves.

    Conclusion

    Monitoring your production workloads is necessary, but working towards true observability across your Kubernetes infrastructure is important. If you’re just starting, the most important of our Kubernetes monitoring best practices is gathering your requirements and defining your business goals.

    Once you have requirements and goals, identify which metrics will help fulfill them before you move on to tooling. Selecting the right tools is an important step, especially choosing between FOSS (one consequence being that your team will have to spend time and effort to implement an in-house monitoring solution) and a paid-for third-party solution (which are usually more exhaustive and come with better support). Compliance and security are other considerations you might need to consider when choosing your tools, depending on your project’s requirements.

    Finally, especially after building an in-house solution, ensure your monitoring system is reliable, which would require monitoring it. And don’t forget that Squadcast can help with the coordination of incident responses within your team.

    Integrated full stack reliability management platform
    Platform
    Blameless
    Lightstep
    Squadcast
    Incident Retrospectives
    Seamless Third-Party Integrations
    Built-In Status Page
    On Call Rotations
    Incident
    Notes
    Advanced Error Budget Tracking
    Try For free
    Platform
    Incident Retrospectives
    Seamless Third-Party Integrations
    Incident
    Notes
    Built-In Status Page
    On Call Rotations
    Advanced Error Budget Tracking
    Blameless
    FireHydrant
    Squadcast
    Try For free
    Share this post:
    Subscribe to our LinkedIn Newsletter to receive more educational content
    Subscribe now

    Subscribe to our latest updates

    Enter your Email Id
    Thank you! Your submission has been received!
    Oops! Something went wrong while submitting the form.
    FAQ
    More from
    Squadcast Community
    Helm Dry Run: Guide & Best Practices
    Helm Dry Run: Guide & Best Practices
    August 27, 2023
    Azure Monitoring Agent: Key Features & Benefits
    Azure Monitoring Agent: Key Features & Benefits
    August 13, 2023
    Introducing Squadcast’s Key Based Deduplication
    Introducing Squadcast’s Key Based Deduplication
    August 1, 2023
    Learn how organizations are using Squadcast
    to maintain and improve upon their Reliability metrics
    Learn how organizations are using Squadcast to maintain and improve upon their Reliability metrics
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds...
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability...
    Alexandre Lessard
    System Analyst
    Martin do Santos
    Platform and Architecture Tech Lead
    Sandro Franchi
    CTO
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast awarded as "Best Software" in the IT Management category by G2 🎉 Read full report here.
    What our
    customers
    have to say
    mapgears
    "Mapgears simplified their complex On-call Alerting process with Squadcast.
    Squadcast has helped us aggregate alerts coming in from hundreds of services into one single platform. We no longer have hundreds of...
    Alexandre Lessard
    System Analyst
    bibam
    "Bibam found their best PagerDuty alternative in Squadcast.
    By moving to Squadcast from Pagerduty, we have seen a serious reduction in alert fatigue, allowing us to focus...
    Martin do Santos
    Platform and Architecture Tech Lead
    tanner
    "Squadcast helped Tanner gain system insights and boost team productivity.
    Squadcast has integrated seamlessly into our DevOps and on-call team's workflows. Thanks to their reliability metrics we have...
    Sandro Franchi
    CTO
    Revamp your Incident Response.
    Peak Reliability
    Easier, Faster, More Automated with SRE.
    Incident Response Mobility
    Manage incidents on the go with Squadcast mobile app for Android and iOS devices
    google playapple store
    google playapple store
    Squadcast - On-call shouldn't suck. Incident response for SRE/DevOps, IT | Product Hunt Embed
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2 Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2 Users love Squadcast on G2
    Squadcast is a leader in Incident Management on G2 Squadcast is a leader in Mid-Market IT Service Management (ITSM) Tools on G2 Squadcast is a leader in Americas IT Alerting on G2
    Best IT Management Products 2022 Squadcast is a leader in Europe IT Alerting on G2 Squadcast is a leader in Mid-Market Asia Pacific Incident Management on G2
    Users love Squadcast on G2
    Copyright © Squadcast Inc. 2017-2023