O11y: Tutorial, Best Practices & Examples

Commonly abbreviated as o11y, observability is a composite strategy involving inspecting a service’s performance, availability, quality, and how it affects other system components.

So is observability different from the practice of simply monitoring systems and applications? We must answer this question in the context of the history of application architecture. Even though some practitioners use the terms interchangeably, o11y has the connotation of being a modern approach to monitoring.

Client-server architecture relied primarily on monitoring system metrics such as CPU and memory during the days when physical servers hosted monolithic applications. As application architecture transitioned from the client-server model to one based on microservices, the priority shifted from monitoring systems (CPU and memory) to monitoring services (latency and error rate). While this transition unfolded, new technologies allowed systems administrators to index and search terabytes of system and application logs distributed throughout their infrastructure and trace a single transaction through the application tiers.

The industry needed a new name to distinguish this new approach of monitoring services using new technologies from the old practice of monitoring infrastructure usage metrics, and that’s how the term o11y was coined. This evolution doesn’t mean monitoring the infrastructure systems is no longer required; it just means that the two paradigms must work in tandem.

This article discusses the three key pillars of observability, the purpose they solve in observing distributed systems, use cases, and popular open-source tools. We also discuss the various recommended practices for the efficient administration of an o11y framework.

Key pillars of o11y

Observability relies on key insights that help determine how the system works, why it behaves the way it does, and how changes can be made to improve its performance. These insights collectively help describe inherent issues with system performance and are categorized as the three pillars of observability.

Key pillar	Description	Sample types	Popular open-source tools
Metrics	A metric is a numeric representation of system attributes measured over a period of time. Metrics record time-series data to detect system vulnerabilities, analyze system behavior, and model historical trends for security and performance optimization.	• CPU utilization • Network throughput • Application response time • API latency • Download activity logs from AWS • Mean time between failures	Prometheus
Event Logs	A log is an immutable record of an event that occurs at any point in the request life cycle. Logs enable granular debugging by providing detailed insights about an event, such as configuration failures or resource conflicts among an application’s components.	• Disk space warnings • Memory dumpst • Application hangs • Non-existent path errors • Unhandled code exceptions	ELK Stack
Distributed Traces	A trace is a representation of the request journey through different components within the application stack. Traces provide an in-depth look into the program’s data flow and progression, allowing developers to identify bottlenecks for improved performance	• OS instrumentation traces • Processor/core data traces • Hardware and software system traces • User-level traces	Jaeger

Administering observability with key pillars

The pillars of observability are essential indicators that measure external outputs to record and analyze the internal state of a system. This information can be used to reconstruct the state of the application’s infrastructure with enough detail to analyze and optimize software processes.

Metrics

Metrics are numerical representations of performance and security data measured in specific time intervals. These time-series data representations are derived as quantifiable measurements to help inspect the performance and health of distributed systems at the component level. Site reliability engineers (SREs) commonly harness metrics with predictive modeling to determine the overall state of the system and correlate projections for improving cluster performance. Since numerical attributes are optimized for storage, processing, compression, and retrieval, metrics allow for longer retention of performance data and are considered ideal for historical trend analysis.

Factors to consider before using metrics for observing distributed systems

First, consider the amount of data being generated by a system and whether it is possible to collect and process all this data in a timely manner. If not, then using metrics can help provide a more accurate picture of what is going on with the system. For instance, collecting granular data can help resolve issues quickly, but it may also impact the system’s storage and processing. To avoid this, it is recommended to only capture data that will give you valuable insights over time instead of collecting and hoarding data that will never be used.
Another factor to consider is the level of granularity required for the observability metric. For example, if you are trying to detect problems with a specific component of the system, you will need metrics that are specific to that component. On the other hand, if you are trying to get a general sense of how the system is performing, then less specific metrics, such as system metrics for garbage collection or CPU usage, may be sufficient.
Finally, it is also important to assess the duration for which you need data for observability purposes. In some cases, real-time data may be necessary to quickly detect and fix problems. In other cases, it may be sufficient to collect data over longer periods of time to build up trends and identify potential issues. For example, if you’re monitoring CPU utilization, a sudden spike could indicate that something is wrong, but if you’re monitoring memory usage, a sudden spike might not be as significant.

Limitations of metrics

Metrics are system-scoped, meaning that they can only track what is happening inside a system. For example, an error rate metric can only track the number of errors generated by the system and does not give any insights that may help determine why the errors are occurring. This is why log entries and traces are used to complement metrics and help isolate the root causes of problems.
Using metrics labels with high cardinality affects system performance. A typical example of this is a distributed, microservices architecture that can result in millions of unique time-series telemetry data elements. In such instances, high metrics cardinality can quickly overwhelm the system by processing and storing enormous amounts of data.

Using Prometheus for metrics collection and monitoring

Prometheus is one of the most popular open-source alerting platforms that collects metrics as time-series data. The toolkit uses a multi-dimensional data model that identifies metrics data using key-value pairs. Prometheus also supports multiple modes for graphing and dashboarding, enhancing observability through metrics visualization.

‍

Prometheus is used across a wide range of system complexities and use cases, including OS monitoring, metrics collection for containers, distributed cluster monitoring, and service monitoring.

‍

Key features of Prometheus include the following:

Autonomous single server nodes, eliminating the need for distributed storage
Flexible metrics querying and analysis through the PromQL query language
Deep integration with cloud-native tools for holistic observability setup
Seamless service discovery of distributed services

Event logs

A log is an unchangeable, timestamped record of events within a distributed system of multiple components. Logs are records of events, errors, and warnings that are generated by the application throughout its life cycle. These records are captured within plaintext, binary, or JSON files as structured or unstructured data, and they also include contextual information associated with the event (such as a client endpoint) for efficient debugging.

Since failures in large-scale deployments rarely arise from one specific event, logs allow SREs to start with a symptom, infer the request life cycle across the software, and iteratively analyze interactions among different parts of the system. With effective logging, SREs can determine the possible triggers and components involved in a performance/reliability issue.

Factors to consider before using event logs for observing distributed systems

The first consideration is the type of system you are operating: Event logs may be more useful in some systems than others. For example, they may be more helpful in a system with numerous microservices, where it can be otherwise difficult to understand the relationship among events occurring in different services.
Another factor is the type of event you are interested in observing. Some events may be more important than others and thus warrant closer scrutiny via event logs. For example, if you are interested in understanding how users are interacting with your system, then events related to user interaction (such as login attempts) would be more important than other event types.
Finally, you must also consider the scale of your system when deciding when to use event logs for observability. A large and complex system may generate a huge volume of events, making it impractical and financially unviable to log all of them. In such cases, it is important to identify the most critical events and focus on only logging those.

Limitations of logs

Archiving event logs requires extensive investment in storage infrastructure.
Log files only capture information that the logging toolkit has been configured to record.
Due to a lack of appropriate indexing, log data is difficult to sort, filter, and search.

Using the ELK stack for logging

Also popularly known as the Elastic Stack, the ELK stack enables SREs to aggregate logs from distributed systems and then analyze and visualize them for quicker troubleshooting, analysis, and monitoring. The stack consists of three open-source projects whose initial letters produce the “ELK” name:

ElasticSearch: A NoSQL search and analytics engine that enables the aggregation, sorting, and analysis of logs
LogStash: An ingestion tool that collects and parses log data from disparate sources of a distributed system
Kibana: A data exploration and visualization tool that enables a graphical review of log events

*Elastic Stack components (Image source*)

‍

While each of the tools can be used independently, they are commonly used together to support use cases that require log monitoring for actionable insights. Some popular use cases of the ELK stack include application performance monitoring (APM), business information analytics, security and compliance administration, and log analysis.

Key features of the ELK stack include the following:

A highly available distributed search engine to support near real-time searches
Enabling real-time log data analysis and visualization
Native support for several programming languages and development frameworks
Multiple hosting options
Centralized logging capabilities

Distributed traces

A trace is a representation of a request as it flows through various components of an application stack. Traces record how long it takes a component to process a request and pass it to the next component, subsequently offering insights into the root causes that trigger errors. A trace enables SREs to observe both the structure of a request and the path it travels.

Since traces in distributed systems allow for the complete analysis of a request’s life cycle, they make it possible to debug events spanning multiple components. Traces can be used for a variety of purposes, such as monitoring performance, understanding bottlenecks, finding errors, and diagnosing problems. They can also be used to validate assumptions about the system’s behavior and to generate hypotheses about potential problems.

Factors to consider before using traces for observing distributed systems

The decision of whether or not to use tracing, and if so, how much tracing to do, depends on a variety of factors specific to each system.

Diligently analyze the size and complexity of the system, the number of users, the types of workloads being run, and the frequency with which changes are made to the system. In general, though, it is advisable to use tracing whenever possible, as it can provide invaluable insights into the workings of a distributed system.
While determining the target endpoints is crucial, traces can be expensive to collect and store, so it is important to make sure that the benefits of using them outweigh the costs. Traces can also add significant overhead to a system, so make sure that they do not impact overall performance.
Finally, it is important to consider the privacy implications of collecting and storing traces. In some cases, traces may contain sensitive information about users or systems that should not be publicly accessible.

Limitations of traces

All components in the request path should be individually configured to propagate trace data.
There’s complex overhead involved in configuring and sampling.
Tracing is expensive and resource-intensive.

Jaeger for distributed tracing

Jaeger is an open source tracing tool used to monitor and trace transactions among distributed services. Jaeger supports several languages using its client instrumentation libraries that are based on OpenTracing APIs, including Go, Java, NodeJS, Python, C++, and C#.

While the tool offers in-memory storage for testing setups, SREs can connect the trace data pipeline to either ElasticSearch or Cassandra for backend storage. The Jaeger console is a feature-rich user interface that allows SREs to visualize distributed traces and develop dependency graphs for end-to-end tracing of microservices and other granular components of distributed systems. Popular Jaeger use cases include distributed transaction monitoring, root cause analysis, distributed content propagation, and performance optimization of containers.

Key features of Jaeger include the following:

A feature-rich UI
Ease of installation and use
Flexibility in configuring the storage backend

Using logs, metrics, and traces together in o11y

Distributed systems are inherently complex and often rely on a number of different moving parts working together. While logs, metrics, and traces each serve a unique purpose for o11y, their goals often overlap and are most suitable for comprehensive visibility by using them together.

In most cases, issue identification starts by observing metrics, which help detect the occurrence of an event that is impacting system performance or compromising security. Once the event is detected, logs help by providing detailed information about the cause of the event by reading errors, warnings, or exceptions generated by endpoints. As system events span multiple services, tracing is used to identify the components and paths responsible for or affected by the event. The practice allows for comprehensive visibility, ultimately leading to improved system stability and availability.

Best practices for comprehensive observability

Although organizations may choose to adopt different practices to suit their use cases, an o11y framework typically relies on the following recommended practices for efficiency.

Commit to one tool for simpler management

Centralizing repositories of metrics and log data from various sources helps simplify the observability of distributed systems. With an aggregated repository, cross-functional teams can collaborate to recognize patterns, detect anomalies, and produce contextual analysis for faster remediation.

Optimize log data

Log data should be optimally organized for storage efficiency and faster data extraction. Besides reducing the time and effort required for log analysis, optimized logs also help developers and SREs prioritize metrics that need to be tracked. To ensure that logs are structured and easily accessible, entries should be formatted for easier correlation and include key parameters that help detect component-level anomalies, such as:

User ID
Session ID
Timestamps
Resource usage

Prioritize data correlation for context analysis

With an aggregated repository, cross-functional teams can collaborate to recognize patterns, detect anomalies quickly, and produce contextual analysis for faster remediation. The approach also offers advanced heuristics, historical context analysis, and pattern recognition capabilities that enable early detection of issues that could potentially degrade performance.

Use deployment markers for distributed tracing

Deployment markers are time-based visual indicators of events in a deployed application. It is recommended to use deployment markers to associate the implications of code change events for efficient visualization of performance metrics and identification of optimizing opportunities. Deployment markers also help implement aggregated logs and component catalogs that act as centralized resources for distributed teams to monitor cluster health and trace root causes.

Implement dynamic sampling

Because distributed systems produce large amounts of observable data, the collection and retention of all that data often lead to system degradation, tedious analysis, and cost inflation. To avoid this, it is recommended to curate the observability data being collected on a set frequency. Curating time-based, sampled data of traces, logs, and metrics helps optimize resource usage while augmenting observability through pattern discovery.

Set up alerts only for critical events

To avoid alert overload and ensure that only meaningful alerts are addressed by site reliability engineers, it is recommended to calibrate alert thresholds that accurately flag system-state events. A severity-based alerting mechanism helps distinguish critical, error, and warning alerts based on conditional factors, subsequently allowing developers and security professionals to prioritize the remediation of flaws.

Integrate alerts and metrics with automated incident response platforms

Integrating alerts with automated incident response systems is a key DevOps practice that can help improve the efficiency of your organization’s response to critical incidents. By automating the process of routing and responding to alerts, you can help ensure that critical incidents are handled in a timely and efficient manner while keeping all stakeholders updated.

Automated incident response systems, such as Squadcast, can help streamline your organization’s overall response to critical incidents. The platform adds context to alerts and metrics through payload analysis and event tagging. The approach not only helps eliminate alert overload but also expedites remediation and shortens MTTR by leveraging executable runbooks. Squadcast also helps track all configured SLOs to identify service-level breaches and capture post-mortem notes to help avoid recurring problems in the future.

Closing thoughts

Observability extends continuous monitoring by enriching performance data, visualizing it, and providing actionable insights to help solve reliability issues. Beyond monitoring application health, o11y mechanisms measure how users interact with the software, tracking changes made to the system and the impacts of such changes, thereby enabling developers to fine-tune the system for ease of use, availability, and security.

Observing modern applications requires a novel approach that relies on deeper analysis and aggregation of a constant stream of data from distributed services. While it is important to inspect the key indicators, it is equally important to adopt the right practices and efficient tools that support o11y to collectively identify what is happening within a system, including its state, behavior, and interactions with other components.