For more straightforward and primitive architectures, capturing system logs and creating dashboards was easy. However, as the number of components grows and the system becomes distributed, the simple notion of logging and monitoring quickly grows into full site reliability engineering (SRE). SRE is a concept coined by Google that describes the efficient scaling and enhanced operational performance of software systems, including SaaS.
Golden signals are the four key metrics used to monitor systems and proactively take action to improve performance and maintain reliability. SRE depends on these golden signals:
- Latency: The mean time required to respond to a request
- Traffic: The total number of requests a system serves over a given period, usually measured in requests per minute (RPM)
- Errors: The number of errors experienced from the end user’s perspective, typically measured over a time interval (e.g., the number of HTTP 404 errors per minute)
- Saturation: The total/peak consumption of various resources at a given time
In this article, we provide a deeper understanding of these golden signals and their significance. We also explore important considerations for organizations to monitor how well the golden signals of their complex systems adhere to their service-level objectives (SLOs) and service-level agreements (SLAs). Finally, we explore some of the valuable tools in the market that support SRE principles.
Summary of key golden signal concepts
The table below summarizes some essential concepts we will cover in this article.
Why do we need golden signals?
Over time, the architectural patterns of software systems have evolved from monolithic to microservices-based, from on-premises to the cloud, and from single locations to distributed systems. These changes have improved the reliability and scaling of these systems, but they have also introduced challenges to traditional logging and monitoring practices.
As systems and architectural patterns evolve, monitoring and pinpointing their exact bottlenecks becomes more challenging. The following are some of the factors introducing this complexity:
- The transient nature of containers
- Autoscaling and variations in resource allocation for containers, which introduces complexity in gathering host system statistics
- Monitoring additional cloud components, such as networks, databases, storage, specialized compute services, and load balancers
- Interdependencies among multiple services
In the past, merely collecting data about the underlying systems was the primary challenge of monitoring and logging. As those practices have evolved into observability literally hundreds of different data points about the minutiae of system behavior can now easily be collected and analyzed. The challenge today is to identify the key indicators of the overall state of the system in this flood of data. Golden Signals are those key indicators which apply to almost all kinds of systems.
What are golden signals?
Let’s explore the four golden signals mentioned previously in more depth. The image below represents the architecture of a simple SaaS platform. It contains the typical infrastructure components used to provide a service to end users via the internet.
The colored dots in the diagram represent possible bottlenecks attributed to respective golden signals based on the table below. Please note that the golden signals may apply to any component in the system. The indicators in the diagram are for the references made in this article.
Latency is the average time required to respond to a particular HTTP request with the current infrastructure design. Several factors directly or indirectly impact the latency of a system. Direct aspects affect the latency when the system is fully functional and there are no external factors involved. Users also experience system latency because of external factors like traffic congestion, network bandwidth, internet speed, etc. These are indirect aspects that influence the latency of the system—factors beyond the system administrator’s control.
Latency always exists as the request travels via the internet to the service endpoint and back after processing. It can never be zero. Efforts are made to minimize the latency, but they cannot remove it completely.
In the diagram above, the red dots indicate the probable causes (direct and indirect) of latency that are under our control. The following are some of the direct causes:
- Inefficient frontend services that cause additional time to be needed to respond to requests
- Authentication logic that takes too long to generate and respond with a session token
- The host infrastructure running on weaker hardware in terms of CPU, network, and memory capacity
- A database engine not powerful enough to execute queries at the required rate
- Poor design of frontend applications, causing more time to load
- Poor design of backend services, which take too much time to process requests
- API gateways, load balancers, bastion hosts, and other components having issues dealing with many requests at once
- Tighter network security controls
Some of the indirect causes may include the following:
- The user’s location being far from the service’s host location, which generally means higher latency
- Lower capacity or older user systems resulting in slow data rendering
In SaaS, latency is somewhat proportional to the number of hops between the user and the service endpoint, and this is true for internal network components as well. Irrespective of how efficient the networking devices are, to fulfill their purpose, they also need to make routing decisions which, although minimal, add to the system’s latency.
However, minimizing the number of hops may not be the only solution to improving latency. Sometimes the key to reducing latency lies in addressing other bottlenecks: the indirect factors affecting latency. Actions here include improving client application performance, using content delivery networks, using global distribution services, and creating multiple region-specific deployments.
Traffic is a measure of the number of requests being served at various times by multiple hardware components in the system. A SaaS application can experience uneven traffic flow due to various factors, such as the following:
- Region-specific features of the SaaS product being consumed in timezones specific to users based on their locations
- Organizational marketing activities that roll out business offers and discounts at specific times
- Regional and global events that trigger users’ need to use the service
- Random traffic surges due to unknown factors
Managing surges in traffic is about having the right autoscaling mechanisms. Inefficient scaling has cost implications: Static scaling of the system to maximum capacity results in unnecessary costs during non-peak hours.
Cloud providers offer various scaling features for their services. Understanding the traffic flow for both predictable traffic and unpredictable (anomalies) surges helps design a scaling strategy to avoid bottlenecks at multiple points in the system.
The blue dots in the diagram represent areas where traffic spikes are likely to impact service delivery. Typically, if the system cannot handle a surge, the result is timeout errors, slowdowns, data loss, and interruptions in the user experience.
Any service interruption caused by bugs in the application code, misconfiguration of network components, or not adhering to the service-level agreements of delivery is known as an error. These errors can have the following effects:
- Users being unable to access the service
- Users being served the wrong content
- Users being unable to access resources to which they are entitled due to incorrect access control configurations
- Broken functionality
- Excessive system response time
With the symptoms listed above and similar others, we can predict the probable cause of the failure but not pinpoint the exact issue, which will need some investigation. Errors may originate from many different places in a system. The yellow dots in the diagram above highlight some probable sources of errors:
- The frontend application may have bugs or unaddressed use cases (e. g., browser compatibility problems)
- Incorrect external and internal network access due to misconfigurations in virtual private clouds (VPCs), virtual private networks (VPNs), security groups, firewalls, and other network components
- Faulty application code
- Faulty authentication and authorization strategy
- Misconfiguration of the host infrastructure
When errors are intermittent, it becomes more challenging to track down the root cause. Tracking the errors, error rates, symptoms, and causes helps with identifying the root cause, and addressing these errors makes the system more stable.
Saturation refers to a scenario where system resources are fully utilized and cannot process any additional requests or handle additional workloads.
It is important to track the resource utilization of various components, which usually have hard capacity limits, like CPU, memory, and storage limits. Knowing the level of resources consumed to serve a request or a certain number of requests provides us with a key metric: saturation.
Some examples of issues caused due to saturation, as highlighted in the image with the amber circles, are the following:
- The frontend service not being able to serve the demand due to insufficient resource allocation
- Autoscaling failure due to insufficient IP address availability in the CIDR range
- Exhausted capacity of backend servers where microservice clusters are hosted, causing subsequent requests to fail
- Saturated read/write capacity of database solutions introducing latency
- Memory allocated to the storage solution being fully utilized, causing new files created to be lost
Knowing the level of saturation of the system resources provides input for scaling capacities up/down or in/out based on utilization to maintain service continuity.
Use golden signals to liaise between alerting and troubleshooting
Alerting and troubleshooting are distinct but related concepts. Alerting refers to the process of triggering automated notifications to stakeholders based on the threshold value of a metric. Troubleshooting is the process of identifying and resolving issues affecting the system’s functionality or performance.
Golden signals are key performance indicators (KPIs) commonly used to monitor and measure the system’s performance. Alerts configured based on these golden signals help identify issues and allow root cause analysis. Alert configurations are conditional and situational, identifying the risk in a given scenario.
However, these alerts do not always indicate the root cause of the system. Thus, the alerts should be considered indicators of seemingly distant causes. Sometimes, they merely provide a direction, which is why troubleshooting is required.
Golden signals are used as a starting point to help narrow down the scope of an issue. The troubleshooting efforts focus on the components and systems related to a particular signal.
Careful investigation/troubleshooting is needed by tracing and sequencing the logs generated from various sources. For example, it is easier to blame the application source code if the incoming requests begin to fail. Troubleshooting helps identify the root cause, which may lie in insufficient storage capacities or internal network access issues.
Use golden signals to define alerts that provide critical feedback quickly. Once the alerts are received, use the insights provided by golden signals to troubleshoot the underlying issue.
Isolate logs and alerts based on business services
A microservice-based architecture implements multiple independent software components designed to perform individual tasks. Additionally, there are copies/replicas of these containerized components, which are short-lived in nature.
Capturing the application logs generated from microservices is easy. However, isolating the logs to map them back to the delivered service is tricky. This is especially the case when multiple microservices, which support multiple business services, may simultaneously generate logs. It is important to embed identifiers and filter the logs for identifying services undergoing troubleshooting efforts.
Sometimes these filters also depend on how the application source code generates the log messages. Application logs, system logs, database access logs, network logs, and any other monitored components are stored in the same location. Carefully configuring these filter keys in the log’s source code and various system components helps with the isolation and tracing of underlying issues.
One tip is to use tags wherever possible.
Implement precise and responsible logging
As the number of components grows, log message generation increases at the same rate. This generates a lot of data and creates a couple of issues: It increases storage costs and makes it difficult to query and filter the logs to trace a buggy event.
It’s important to generate meaningful logs to utilize logging and monitoring systems efficiently. Redundant logs or ones that provide little value should be avoided or accommodated in the existing logging schema. Trying to consolidate as many relevant attribute details as possible into a single log statement can provide all the needed information with less tracing effort.
Use cold-warm-hot storage to manage costs
Storing all the logs at once in a database can become expensive, especially when the rate of log message generation increases over time. Implementing periodic cold-warm-hot storage cycles helps reduce storage costs. For example, consider storing the logs generated in the last hour in a database where it is possible to execute queries quickly (hot). Then retire them to a cheaper database or file storage solution (warm) for the next 24 hours, where the logs are still available on-demand with compromised query performance. Finally, retire them to cold storage solutions, where the logs are not required to be accessed frequently.
Moderate the request-to-log ratio
To predict the resources needed to implement logging and monitoring solutions, it is recommended to adhere to a specific request-to-logs ratio, which indicates the number of logs a single request generates when a success or failure use case is executed.
Moderating this ratio helps strategize and quantify the logging and monitoring solutions.
Various self-managed and managed solutions are available in the market to monitor, track, and visualize metrics like golden signals. Typically, the following are the most desired main features of monitoring tools to support SRE principles:
- A real-time health dashboard to represent the status of four golden signals and additional parameters, such as Grafana’s RED (rate, error duration) dashboards.
- Capture and query application and system logs for root cause analysis (RCA)
- Various levels of logging (ERR, INFO, WARN, and DEBUG)
- The ability to generate reports and trends based on historical data
- Intuitive visualization tools to pinpoint probable causes of failure
- Integration with on-premises and cloud infrastructure components
- Integration with third-party SaaS solutions
- The ability to set customized alerts
- Support for the organization’s incident management process
Self-hosted (open source) solutions offer budget-friendly tooling options. However, these solutions incur infrastructure and maintenance costs because they are usually hosted on-premises. Some options for self-hosted monitoring solutions include Prometheus, Zabbix, Nagios, Cacti, and Sensu.
If budget is not a constraint and convenience is the priority, organizations can opt for managed monitoring solutions like Solarwinds, Datadog, AppDynamics, Honeycomb, or Pingdom.
Note that some of the open source tools described above also offer managed services.
In this article, we explained what golden signals are and their role in the SRE domain. As systems and architectural patterns evolve, this gives rise to the need for better monitoring solutions. Organizations that deliver services with complex internal architecture need to keep an eye on these golden signals for faster response and resolution. This is where visualization tools come into the picture, and we presented various self-managed and managed options.
It is also important to configure the desired monitoring solutions to align them with business services, depending on the internal setup and component architecture. When the sources, metrics, and parameters are aligned, this adds more meaning to the golden signals and other key metrics.