Three Ways to Trace End-to-end

Having end-to-end distributed traces is a huge challenge for any project. In distributed tracing, end-to-end tracing is a term often used to refer to traces that capture most components in a critical path. Imagine an HTTP request made to trigger a Lambda function. Being able to see the HTTP client, load balancing and scheduling decisions, any outgoing requests made from the Lambda function to serve the request would be “more end-to-end” than just capturing a trace with a function invocation. End-to-end traces are critical because they tell you the larger picture story and how your systems interact with others. They make it easier to diagnose issues, highlight service dependencies, inefficient scheduling or execution patterns, and more.

On the left, we see a trace only with client and server side spans for a Lambda function trigger. On the right, we see more spans including ALB, and outgoing requests to S3 and to a Redis server.

There are two main prerequisites to have end-to-end traces:

The challenge exists because not every component is instrumented to produce distributed traces, and not every component understands distributed tracing context/headers (e.g. B3, W3C TraceContext, X-Amzn-Trace-Id). Even if they do, they may be using different header standards or produce vendor-specific trace data that is hard or impossible to merge. Especially with legacy systems where you can’t modify the source code, or with components without any established context propagation and middleware support, adding tracing or modifying the existing work can be a tremendously expensive investment.

Transforming the trace context

One of the common difficulties in distributed tracing is the lack of a “standard” propagation header. The community has established options like B3, W3C TraceContext and various vendor-specific headers like X-Amzn-Trace-Id. Projects that are already instrumented to parse one of these formats, is completely unaware of the others.

If service A only understands W3C TraceContext, and service B only understands B3 TraceContext, service B will drop the incoming trace. One intermediate option would be converting the W3C TraceContext to B3 before service B receives the incoming request. If service B is not a the ultimate downstream service or there are other services that require a non-B3 header, you may need to convert the headers back and forth.

Pros

Cons

Linking traces

In distributed tracing, linking is a concept to be able to associate two or more traces. Even though not all the data is captured under the same trace, when you link traces, you can still navigate from one to the other.

A trace collected for serviceA.Lookup makes a request to serviceB.Query. The client span links to another trace that contains the serviceB.Query’s server-side traces.

Assume serviceA and serviceB are instrumented differently and don’t export their tracing data to the same tracing system. In this case, they will have split traces and we can manually link these two traces to make it easier to navigate between them. If serviceB.Query returns its trace ID in the response, the client span of serviceB.Query can record it as a span annotation. Then when browsing the serviceA.Lookup trace, it would be possible to navigate to the linked serviceB.Query trace.

Pros

Cons

Partial traces

One of the reasonable options would be giving up on “end to end” and only focus on collecting partial traces from the services your group owns and maintains. Even though this is technically not a way to achieve “end-to-end”, it may help the organizations to understand the benefits of having distributed tracing. Even without end-to-end traces, it’s still useful see traces from a specific component to debug and identify issues. This provides a bottom-up approach where a team can communicate the value of distributed traces to the larger organization without having to convenience the entire organization to invest in it.

Even though, in an ideal world, we would like to have as many as end-to-end traces, it’s not always easy or feasible to have them. These workarounds may be temporary patches but can help you to get the most of distributed tracing.

--

--

See rakyll.org for more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store