Golden Signals: Tutorial & Best Practices

For more straightforward and primitive architectures, capturing system logs and creating dashboards was easy. However, as the number of components grows and the system becomes distributed, the simple notion of logging and monitoring quickly grows into full site reliability engineering (SRE). SRE is a concept coined by Google that describes the efficient scaling and enhanced operational performance of software systems, including SaaS.

Golden signals are the four key metrics used to monitor systems and proactively take action to improve performance and maintain reliability. SRE depends on these golden signals:

Latency: The mean time required to respond to a request
Traffic: The total number of requests a system serves over a given period, usually measured in requests per minute (RPM)
Errors: The number of errors experienced from the end user’s perspective, typically measured over a time interval (e.g., the number of HTTP 404 errors per minute)
Saturation: The total/peak consumption of various resources at a given time

In this article, we provide a deeper understanding of these golden signals and their significance. We also explore important considerations for organizations to monitor how well the golden signals of their complex systems adhere to their service-level objectives (SLOs) and service-level agreements (SLAs). Finally, we explore some of the valuable tools in the market that support SRE principles.

Summary of key golden signal concepts

The table below summarizes some essential concepts we will cover in this article.

Concept	Description
Monitoring	Tracking and triggering alerts based on the log and metric information generated by the system
Alerting	Notifying the correct stakeholder or team about a particular threshold event
Logging	Generating and storing critical information about the transactions processed and executed by various system components
Metrics	Measuring specific characteristics of a system to quantify, track, and assess aspects like performance, process, quality
Latency	The average total time required to process a request
Traffic	The number of incoming requests at any given point in time
Errors	HTTP responses with error codes attributed to an inability to access services, incorrect information, or failure to adhere to SLOs and SLAs
Saturation	A measure of load on the system’s network, compute, and storage resources, usually measured against the hardware capacity of the given component
Service	A combination of processes, tools, microservices, and infrastructure components responsible for delivering value to users

Why do we need golden signals?

Over time, the architectural patterns of software systems have evolved from monolithic to microservices-based, from on-premises to the cloud, and from single locations to distributed systems. These changes have improved the reliability and scaling of these systems, but they have also introduced challenges to traditional logging and monitoring practices.

As systems and architectural patterns evolve, monitoring and pinpointing their exact bottlenecks becomes more challenging. The following are some of the factors introducing this complexity:

The transient nature of containers
Autoscaling and variations in resource allocation for containers, which introduces complexity in gathering host system statistics
Monitoring additional cloud components, such as networks, databases, storage, specialized compute services, and load balancers
Interdependencies among multiple services

In the past, merely collecting data about the underlying systems was the primary challenge of monitoring and logging. As those practices have evolved into observability literally hundreds of different data points about the minutiae of system behavior can now easily be collected and analyzed. The challenge today is to identify the key indicators of the overall state of the system in this flood of data. Golden Signals are those key indicators which apply to almost all kinds of systems.

What are golden signals?

Let’s explore the four golden signals mentioned previously in more depth. The image below represents the architecture of a simple SaaS platform. It contains the typical infrastructure components used to provide a service to end users via the internet.

The colored dots in the diagram represent possible bottlenecks attributed to respective golden signals based on the table below. Please note that the golden signals may apply to any component in the system. The indicators in the diagram are for the references made in this article.

Color	Related golden signal
Red	Latency
Blue	Traffic
Yellow	Errors
Amber	Saturation

Latency

Latency is the average time required to respond to a particular HTTP request with the current infrastructure design. Several factors directly or indirectly impact the latency of a system. Direct aspects affect the latency when the system is fully functional and there are no external factors involved. Users also experience system latency because of external factors like traffic congestion, network bandwidth, internet speed, etc. These are indirect aspects that influence the latency of the system—factors beyond the system administrator’s control.

Latency always exists as the request travels via the internet to the service endpoint and back after processing. It can never be zero. Efforts are made to minimize the latency, but they cannot remove it completely.

In the diagram above, the red dots indicate the probable causes (direct and indirect) of latency that are under our control. The following are some of the direct causes:

Inefficient frontend services that cause additional time to be needed to respond to requests
Authentication logic that takes too long to generate and respond with a session token
The host infrastructure running on weaker hardware in terms of CPU, network, and memory capacity
A database engine not powerful enough to execute queries at the required rate
Poor design of frontend applications, causing more time to load
Poor design of backend services, which take too much time to process requests
API gateways, load balancers, bastion hosts, and other components having issues dealing with many requests at once
Tighter network security controls

Some of the indirect causes may include the following:

The user’s location being far from the service’s host location, which generally means higher latency
Lower capacity or older user systems resulting in slow data rendering

In SaaS, latency is somewhat proportional to the number of hops between the user and the service endpoint, and this is true for internal network components as well. Irrespective of how efficient the networking devices are, to fulfill their purpose, they also need to make routing decisions which, although minimal, add to the system’s latency.

However, minimizing the number of hops may not be the only solution to improving latency. Sometimes the key to reducing latency lies in addressing other bottlenecks: the indirect factors affecting latency. Actions here include improving client application performance, using content delivery networks, using global distribution services, and creating multiple region-specific deployments.

Traffic

Traffic is a measure of the number of requests being served at various times by multiple hardware components in the system. A SaaS application can experience uneven traffic flow due to various factors, such as the following:

Region-specific features of the SaaS product being consumed in timezones specific to users based on their locations
Organizational marketing activities that roll out business offers and discounts at specific times
Regional and global events that trigger users’ need to use the service
Random traffic surges due to unknown factors

Managing surges in traffic is about having the right autoscaling mechanisms. Inefficient scaling has cost implications: Static scaling of the system to maximum capacity results in unnecessary costs during non-peak hours.

Cloud providers offer various scaling features for their services. Understanding the traffic flow for both predictable traffic and unpredictable (anomalies) surges helps design a scaling strategy to avoid bottlenecks at multiple points in the system.

The blue dots in the diagram represent areas where traffic spikes are likely to impact service delivery. Typically, if the system cannot handle a surge, the result is timeout errors, slowdowns, data loss, and interruptions in the user experience.

Errors

Any service interruption caused by bugs in the application code, misconfiguration of network components, or not adhering to the service-level agreements of delivery is known as an error. These errors can have the following effects:

Users being unable to access the service
Users being served the wrong content
Users being unable to access resources to which they are entitled due to incorrect access control configurations
Broken functionality
Excessive system response time

With the symptoms listed above and similar others, we can predict the probable cause of the failure but not pinpoint the exact issue, which will need some investigation. Errors may originate from many different places in a system. The yellow dots in the diagram above highlight some probable sources of errors:

The frontend application may have bugs or unaddressed use cases (e. g., browser compatibility problems)
Incorrect external and internal network access due to misconfigurations in virtual private clouds (VPCs), virtual private networks (VPNs), security groups, firewalls, and other network components
Faulty application code
Faulty authentication and authorization strategy
Misconfiguration of the host infrastructure

When errors are intermittent, it becomes more challenging to track down the root cause. Tracking the errors, error rates, symptoms, and causes helps with identifying the root cause, and addressing these errors makes the system more stable.

Saturation

Saturation refers to a scenario where system resources are fully utilized and cannot process any additional requests or handle additional workloads.

It is important to track the resource utilization of various components, which usually have hard capacity limits, like CPU, memory, and storage limits. Knowing the level of resources consumed to serve a request or a certain number of requests provides us with a key metric: saturation.

Some examples of issues caused due to saturation, as highlighted in the image with the amber circles, are the following:

The frontend service not being able to serve the demand due to insufficient resource allocation
Autoscaling failure due to insufficient IP address availability in the CIDR range
Exhausted capacity of backend servers where microservice clusters are hosted, causing subsequent requests to fail
Saturated read/write capacity of database solutions introducing latency
Memory allocated to the storage solution being fully utilized, causing new files created to be lost

Knowing the level of saturation of the system resources provides input for scaling capacities up/down or in/out based on utilization to maintain service continuity.

Best practices

Use golden signals to liaise between alerting and troubleshooting

Alerting and troubleshooting are distinct but related concepts. Alerting refers to the process of triggering automated notifications to stakeholders based on the threshold value of a metric. Troubleshooting is the process of identifying and resolving issues affecting the system’s functionality or performance.

Golden signals are key performance indicators (KPIs) commonly used to monitor and measure the system’s performance. Alerts configured based on these golden signals help identify issues and allow root cause analysis. Alert configurations are conditional and situational, identifying the risk in a given scenario.

However, these alerts do not always indicate the root cause of the system. Thus, the alerts should be considered indicators of seemingly distant causes. Sometimes, they merely provide a direction, which is why troubleshooting is required.

Golden signals are used as a starting point to help narrow down the scope of an issue. The troubleshooting efforts focus on the components and systems related to a particular signal.

Careful investigation/troubleshooting is needed by tracing and sequencing the logs generated from various sources. For example, it is easier to blame the application source code if the incoming requests begin to fail. Troubleshooting helps identify the root cause, which may lie in insufficient storage capacities or internal network access issues.

Use golden signals to define alerts that provide critical feedback quickly. Once the alerts are received, use the insights provided by golden signals to troubleshoot the underlying issue.

Isolate logs and alerts based on business services

A microservice-based architecture implements multiple independent software components designed to perform individual tasks. Additionally, there are copies/replicas of these containerized components, which are short-lived in nature.

Capturing the application logs generated from microservices is easy. However, isolating the logs to map them back to the delivered service is tricky. This is especially the case when multiple microservices, which support multiple business services, may simultaneously generate logs. It is important to embed identifiers and filter the logs for identifying services undergoing troubleshooting efforts.

Sometimes these filters also depend on how the application source code generates the log messages. Application logs, system logs, database access logs, network logs, and any other monitored components are stored in the same location. Carefully configuring these filter keys in the log’s source code and various system components helps with the isolation and tracing of underlying issues.

One tip is to use tags wherever possible.

Implement precise and responsible logging

As the number of components grows, log message generation increases at the same rate. This generates a lot of data and creates a couple of issues: It increases storage costs and makes it difficult to query and filter the logs to trace a buggy event.

It’s important to generate meaningful logs to utilize logging and monitoring systems efficiently. Redundant logs or ones that provide little value should be avoided or accommodated in the existing logging schema. Trying to consolidate as many relevant attribute details as possible into a single log statement can provide all the needed information with less tracing effort.

Use cold-warm-hot storage to manage costs

Storing all the logs at once in a database can become expensive, especially when the rate of log message generation increases over time. Implementing periodic cold-warm-hot storage cycles helps reduce storage costs. For example, consider storing the logs generated in the last hour in a database where it is possible to execute queries quickly (hot). Then retire them to a cheaper database or file storage solution (warm) for the next 24 hours, where the logs are still available on-demand with compromised query performance. Finally, retire them to cold storage solutions, where the logs are not required to be accessed frequently.

Moderate the request-to-log ratio

To predict the resources needed to implement logging and monitoring solutions, it is recommended to adhere to a specific request-to-logs ratio, which indicates the number of logs a single request generates when a success or failure use case is executed.

Moderating this ratio helps strategize and quantify the logging and monitoring solutions.

Tools

Various self-managed and managed solutions are available in the market to monitor, track, and visualize metrics like golden signals. Typically, the following are the most desired main features of monitoring tools to support SRE principles:

A real-time health dashboard to represent the status of four golden signals and additional parameters, such as Grafana’s RED (rate, error duration) dashboards.
Capture and query application and system logs for root cause analysis (RCA)
Various levels of logging (ERR, INFO, WARN, and DEBUG)
The ability to generate reports and trends based on historical data
Intuitive visualization tools to pinpoint probable causes of failure
Integration with on-premises and cloud infrastructure components
Integration with third-party SaaS solutions
The ability to set customized alerts
Support for the organization’s incident management process

Self-hosted (open source) solutions offer budget-friendly tooling options. However, these solutions incur infrastructure and maintenance costs because they are usually hosted on-premises. Some options for self-hosted monitoring solutions include Prometheus, Zabbix, Nagios, Cacti, and Sensu.

If budget is not a constraint and convenience is the priority, organizations can opt for managed monitoring solutions like Solarwinds, Datadog, AppDynamics, Honeycomb, or Pingdom.

Note that some of the open source tools described above also offer managed services.

Conclusion

In this article, we explained what golden signals are and their role in the SRE domain. As systems and architectural patterns evolve, this gives rise to the need for better monitoring solutions. Organizations that deliver services with complex internal architecture need to keep an eye on these golden signals for faster response and resolution. This is where visualization tools come into the picture, and we presented various self-managed and managed options.

It is also important to configure the desired monitoring solutions to align them with business services, depending on the internal setup and component architecture. When the sources, metrics, and parameters are aligned, this adds more meaning to the golden signals and other key metrics.

Golden Signals