Why is Distributed Tracing Broken?

What does it take to introduce distributed tracing into your system?

  • Sort out what distributing tracing is.
  • Sort out what you can trace end-to-end. There are always some legacy blackholes that will drop your traces. If it is in your critical path, you need ways to go around hole.
  • Sort out where to begin. Most users assume you start with instrumenting your code. You should ideally start with HTTP or RPC traces and only touch an instrumentation library if you need custom spans and annotations.
  • Sort out which propagation format you can use. Sort out which components of your system can understand that format.

What might the ideal distributed tracing world be?

  • There is a common propagation format and everyone understands each other’s traces. This eliminates the necessity of all systems need to be linked with the same vendor-specific libraries to be compatible.
  • There is a common exposition format. Either I export directly to backends or store myself. I can consume the data myself or preprocess. Once there is a common exposition format, there will be more tools that can utilize the tracing data other than the fully featured tracing backends.
  • Users get traces automatically from HTTP clients/frameworks, RPC frameworks. Users don’t have to instrument anything unless they specifically need fine-grained spans or annotations.
  • No instrumentation library lock-in. We encourage an ecosystem with various instrumentation libraries that can generate traces in the common exposition format.

I’d prefer stop adding complexity and more incompatible solutions to the mix. We need a data format and a defacto way to identify traces, then a defacto way to propagate them on wire and in-process.

Distributed tracing cannot be mainstream if we fail to achieve a basic level of compatibility with each other.

