Correlation in Latency Analysis

Challenges

We define availability as Service Level Objectives (SLOs). Latency is a critical element in our SLOs because we describe availability based on customer experience. We measure latency by collecting latency metrics from services. Metrics help us identify SLO violations, but they are not enough for debugging. Once a team receives an alert on a latency issue, they are often relying on additional telemetry signals to further narrow down the source of the problem. Beside metrics, we collect distributed traces, event logs and occasionally other runtime profiles. The additional signals are critical for our engineers to gather more context about the problem. Each signal is useful to identify a category of problems. For example, unexpected numbers of retries are visible in distributed traces, a CPU-heavy library call is visible in our continuous profiling dashboards, poor runtime scheduling is visible in runtime event logs.

Correlation

If an SLO violation happens, our engineers get alerts. Upon receiving an alert, they initially land at the metrics dashboards. Our approach has fundamentally changed how engineers navigate among dashboards. We redesigned our metric collection library to collect a reference to the current trace (if there is a trace in the current context) as it’s collecting a latency metric. This allowed our monitoring dashboards to display latency buckets with exemplar traces. It allowed our engineers to see example traces for each latency bucket on the monitoring dashboards and navigate to traces without friction. Secondly, we started to correlate event logs with traces. It became possible to navigate from a trace to logs with a single click. The final improvement was to introduce the correlation between traces and runtime profiles. In production, we sometimes enable runtime execution tracers to debug language runtimes (e.g. unexpected garbage collector pauses). Runtime events are expensive to collect and runtimes produce a lot of events. If this type of collection is needed, we enable collection very briefly and then disable. Even though the collection is brief, collected data can be hundreds of thousands of events. By linking runtime trace events to distributed traces, we made it possible to navigate from distributed traces to runtime events with a single click.

Results

Our approach to correlate instrumentation signals changed the way we instrument and present data in latency analysis. We made it possible for our engineers to navigate among different telemetry signals without losing context. It now takes several seconds to navigate among telemetry signals that once required minutes. These capabilities also helped us to be able to focus and debug certain customer cases more easily.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store