Correlation in Latency Analysis
This article was my response to Amazon’s writing assessment when I was interviewed. I answered the question of “What is the most inventive or innovative thing you’ve done? It doesn’t have to be something that’s patented. It could be a process change, product idea, a new metric or customer facing interface — something that was your idea… [retracted]
At Google, our systems are complex and large. Our products rely on many microservices, caches, databases, storage systems and networking infrastructure. In the critical path, a user request may touch up to 100+ components. These components have different deployment and maintenance cycles. Most of our services are built and maintained by different teams. Teams rarely have full insight into each other. Measuring latency, identifying latency issues and being able to debug them effectively is an important capability for the reliability of our systems. Producing actionable telemetry data and improving its navigability have been a critical task for my team. I helped to invent new approaches to correlate telemetry signals (exemplars, correlation between tracing and logging, profiler labels) that helped our engineers to navigate latency problems faster.
We define availability as Service Level Objectives (SLOs). Latency is a critical element in our SLOs because we describe availability based on customer experience. We measure latency by collecting latency metrics from services. Metrics help us identify SLO violations, but they are not enough for debugging. Once a team receives an alert on a latency issue, they are often relying on additional telemetry signals to further narrow down the source of the problem. Beside metrics, we collect distributed traces, event logs and occasionally other runtime profiles. The additional signals are critical for our engineers to gather more context about the problem. Each signal is useful to identify a category of problems. For example, unexpected numbers of retries are visible in distributed traces, a CPU-heavy library call is visible in our continuous profiling dashboards, poor runtime scheduling is visible in runtime event logs.
Until recently, it was hard to narrow down the problem by filtering additional signals during an incident. I looked at our surveys to see only 1X% of our engineers said they can find relevant distributed traces easily. We collect tens of millions of traces per second globally, looking for a trace that may have clues about a service’s SLO violation is looking for a needle in a haystack. Additionally, we use different dashboards and visualization tools for metrics, traces, logs, and profiles. Our engineers have to learn the filtering capabilities of 4 or 5 tools in order to search for data relevant to incidents. It is a long and error-prone process. We collect metrics for every user request, but other signals (such as traces, logs, profiles, etc) are downsampled or are optional. It’s sometimes a couple of minutes of work just to see if there had been any additional signals collected for a case.
If an SLO violation happens, our engineers get alerts. Upon receiving an alert, they initially land at the metrics dashboards. Our approach has fundamentally changed how engineers navigate among dashboards. We redesigned our metric collection library to collect a reference to the current trace (if there is a trace in the current context) as it’s collecting a latency metric. This allowed our monitoring dashboards to display latency buckets with exemplar traces. It allowed our engineers to see example traces for each latency bucket on the monitoring dashboards and navigate to traces without friction. Secondly, we started to correlate event logs with traces. It became possible to navigate from a trace to logs with a single click. The final improvement was to introduce the correlation between traces and runtime profiles. In production, we sometimes enable runtime execution tracers to debug language runtimes (e.g. unexpected garbage collector pauses). Runtime events are expensive to collect and runtimes produce a lot of events. If this type of collection is needed, we enable collection very briefly and then disable. Even though the collection is brief, collected data can be hundreds of thousands of events. By linking runtime trace events to distributed traces, we made it possible to navigate from distributed traces to runtime events with a single click.
Our standard context propagation library played a significant role making this work possible. By propagating the trace headers in the context all through the stack, we enabled all the other instrumentation libraries to recognize the current trace and correlate. It was a large effort to land the rest of the work — an effort of 10+ teams. I led various aspects of coordinating and designing instrumentation library changes, contributing to the data format and service APIs, giving feedback to UX improvements, implementing similar capabilities in our open source stack (e.g. OpenCensus) and playing a role where I advocated for correlation widely in the industry with contributions to several open source projects.
Our approach to correlate instrumentation signals changed the way we instrument and present data in latency analysis. We made it possible for our engineers to navigate among different telemetry signals without losing context. It now takes several seconds to navigate among telemetry signals that once required minutes. These capabilities also helped us to be able to focus and debug certain customer cases more easily.
Beside improving incident response, the concept of correlation became a fundamental approach. When it comes to instrumentation, one size doesn’t fit all. When product teams need custom telemetry signals, they correlate their custom signals with the canonical ones. Creating awareness of the availability of the custom signals is less effort now. Correlations also improved the time it takes to hand off systems because it takes less time to onboard new members and teach them how to debug our systems end-to-end. Our work has influenced other instrumentation standards and libraries. I worked closely with the industry where they can benefit from similar concepts. Recently, our approach is adopted by the OpenMetrics initiated by Prometheus, OpenTelemetry, and Go’s execution tracer. Go 1.11 is released with capabilities to correlate runtime events with distributed traces.
With minimal changes to existing instrumentation libraries and respecting custom needs, our debugging capabilities and time-to-debug are improved. Today, besides being a fundamental tool in debugging, correlations play a role as a training tool and shortening the time to onboard new engineers.