Microservices Observability

Published also at https://jbd.dev/microservices-instrumentation/.

What makes microservices observability different than observability of monolithic systems?

Observability is the activities that involve measuring, collecting, and analyzing various diagnostics signals from a system. These signals may include metrics, traces, logs, events, profiles and more.

In monolithic systems, the scope is a single process. A user request comes in and goes out from a service. Diagnostic data is collected in the scope of a single process.

When we first started to build large-scale microservices systems, some practices from the monolithic systems didn’t scale. A typical problem is to collect signals and being able to tell if we are meeting SLOs between them. Unlike the monolithic systems, nothing is owned end-to-end by a single team. Various teams build various parts of the system and agree to meet on an SLO.

Authentication service depends on the datastore and ask their team to meet a certain SLO. Similarly, reporting service depends on datastore and indexing, and indexing is depending on datastore.

In microservices architectures, it is likely that some services will be a common dependency for a lot of teams. Some examples of these services are authentication and storage that everyone needs and ends up depending on. On the other hand, more particularly, expectations from services vary. Authentication and indexing services might have wildly different requirements from the datastore service. Datastore service needs to understand the individual impact of all of these different services.

This is why we created a concept that allows us to add more dimensions to the collected data. We call them tags. Tags are key/value pairs we attach to the recording signal, some example tags are the RPC name, originator service name, etc. Tags are what you want to breakdown your observability data with.

Once we collect the diagnostics signal with enough dimensions, we can create interesting analysis reports and alerts. Some examples:

  • Give me the datastore request latency for RPCs originated at the auth service.
  • Give me the traces for rpc.method = “datastore.Query”.

The datastore service is decoupled from the other services and doesn’t know much about the others. So it is not possible for the datastore service to add fine grained tags that can reflect all the dimensions user want to see when they break down the collected diagnostics data.

The solution to this problem is to produce the tags at the upper-level services that calls into the lower-level services. After producing these tags, we will propagate them on the wire as a part of the RPC. The datastore service can enrich the incoming tags with additional information it already knows (such as the incoming RPC name) and record the diagnostics data with the enriched tags.

Context is the general concept to propagate various key/values among the services in distributed systems. Some languages, like Go, even have a common type in their standard libraries. We use the common context type to propagate the diagnostics related key/values in our systems. The receiving endpoint extracts the tags and keep propagating them if it needs to make more RPCs in order to respond to the incoming call.

This is how we have fine grained dimensions at the lower ends of the stack regardless of how many layers of services there are until a call is received at the lowest end.

We can then use observability signals to answer some of the critical questions such as:

  • Is the datastore team meeting their SLOs for service X?
  • What’s the impact of service X on the datastore service?
  • How much do we need to scale up a service if service X grows 10%?

Propagating tags and collecting the diagnostics data are opening a lot of new ways to consume the observability signals in distributed systems. This is how your teams understand the impact of services all across the stack even if they don’t know much about the internals of each other’s services.

See rakyll.org for more.