Three Ways to Trace End-to-end

Having end-to-end distributed traces is a huge challenge for any project. In distributed tracing, end-to-end tracing is a term often used to refer to traces that capture most components in a critical path. Imagine an HTTP request made to trigger a Lambda function. Being able to see the HTTP client, load balancing and scheduling decisions, any outgoing requests made from the Lambda function to serve the request would be “more end-to-end” than just capturing a trace with a function invocation. End-to-end traces are critical because they tell you the larger picture story and how your systems interact with others. They make it easier to diagnose issues, highlight service dependencies, inefficient scheduling or execution patterns, and more.

On the left, we see a trace only with client and server side spans for a Lambda function trigger. On the right, we see more spans including ALB, and outgoing requests to S3 and to a Redis server.

There are two main prerequisites to have end-to-end traces:

  • Being able to accept and/or propagate the distributed tracing context
  • Participating into the incoming trace and producing spans

The challenge exists because not every component is instrumented to produce distributed traces, and not every component understands distributed tracing context/headers (e.g. B3, W3C TraceContext, X-Amzn-Trace-Id). Even if they do, they may be using different header standards or produce vendor-specific trace data that is hard or impossible to merge. Especially with legacy systems where you can’t modify the source code, or with components without any established context propagation and middleware support, adding tracing or modifying the existing work can be a tremendously expensive investment.

One of the common difficulties in distributed tracing is the lack of a “standard” propagation header. The community has established options like B3, W3C TraceContext and various vendor-specific headers like X-Amzn-Trace-Id. Projects that are already instrumented to parse one of these formats, is completely unaware of the others.

If service A only understands W3C TraceContext, and service B only understands B3 TraceContext, service B will drop the incoming trace. One intermediate option would be converting the W3C TraceContext to B3 before service B receives the incoming request. If service B is not a the ultimate downstream service or there are other services that require a non-B3 header, you may need to convert the headers back and forth.

Pros

  • It is simple to implement if downstream services consistently are supporting a different header.
  • If the converter is implemented as a proxy server, it doesn’t require any changes to the existing services.
  • When combined with a vendor-agnostic collection pipeline like the OpenTelemetry collector, services can keep publishing trace spans in vendor specific formats, and the collector can transform and send them to the same service. For example, if service A is using Jaeger and B is using Zipkin, OpenTelemetry collector can both accept spans coming from the services and transform all the data to be sent to Jaeger or Zipkin.

Cons

  • If trace headers are not convertible due to reasons like different identifier lengths or different fields, this option is not possible to implement.
  • If downstream services support a variety of different headers, this option is very complicated to implement and maintain.

In distributed tracing, linking is a concept to be able to associate two or more traces. Even though not all the data is captured under the same trace, when you link traces, you can still navigate from one to the other.

A trace collected for serviceA.Lookup makes a request to serviceB.Query. The client span links to another trace that contains the serviceB.Query’s server-side traces.

Assume serviceA and serviceB are instrumented differently and don’t export their tracing data to the same tracing system. In this case, they will have split traces and we can manually link these two traces to make it easier to navigate between them. If serviceB.Query returns its trace ID in the response, the client span of serviceB.Query can record it as a span annotation. Then when browsing the serviceA.Lookup trace, it would be possible to navigate to the linked serviceB.Query trace.

Pros

  • It doesn’t require any significant changes to the existing instrumentation and propagation formats.
  • Allows different data access levels. For example, sensitive traces can be split as a linked trace with different access levels.

Cons

  • Requires custom ways to propagate back the “linked trace” and record it.
  • Not many distributed tools support visualizing or querying links. Even when querying supported, querying is very limited in comparison to having all the spans under the same trace.
  • Hard to produce automatic service maps based on trace data, you may need to build custom solutions.
  • Propagating the downsampling decision is still a challenge unless all components are using the same trace header.

One of the reasonable options would be giving up on “end to end” and only focus on collecting partial traces from the services your group owns and maintains. Even though this is technically not a way to achieve “end-to-end”, it may help the organizations to understand the benefits of having distributed tracing. Even without end-to-end traces, it’s still useful see traces from a specific component to debug and identify issues. This provides a bottom-up approach where a team can communicate the value of distributed traces to the larger organization without having to convenience the entire organization to invest in it.

Even though, in an ideal world, we would like to have as many as end-to-end traces, it’s not always easy or feasible to have them. These workarounds may be temporary patches but can help you to get the most of distributed tracing.

See rakyll.org for more.