Google’s Approach to Observability

A cross-stack framework

At Google, an average service is likely to be depending on tens of other services at any given time. This results in the challenge of figuring out originator of the problems in the lower ends of the stack.

Components

Observability is a multi dimensional problem. We currently provide stats collection and distributed tracing but it can be extended by other components.

Instrumentation is on by default

Our philosophy is to make instrumentation so cheap that users don’t need to think twice whether recording is on or not. Library authors also don’t have care or provide configuration to turn it on or off. We provide a fast mechanism to record things and drop them immediately if you don’t need to export the data. Given instrumentation bits are always in the final binary, users can optionally turn things on at the production time dynamically when there is a problem and see additional diagnostics data coming from services to understand the case.

Aggregation of data

We make it cheap by aggregating diagnostics data at the node and reduce the diagnostics data traffic. Tons of large scale products saved significant amount of resources once we started to aggregate data.

White box and black box

The benefit of agreeing on a common framework that it creates an environment white box instrumentation is already baked in everywhere the same way and it also fosters an environment of integrations. Load balancers, RPC frameworks, networking services, etc. can easily auto instrument. Given the underlying instrumentation framework is the same, it is easy to start the instrumentation at an integration point (e.g. load balancer) and keep using the same library to add precise additional data.

The future of observability data

Observability data is providing clear and precise data about the usage and utilization. One of the most significant contributions the collected data can make is to help adaptive systems utilize their resources better. What else would be possible if load balancers and schedulers knew more about these highly precise diagnostics signals from the services they are serving to?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store