Debugging Latency in Go 1.11

Jaana Dogan
6 min readMay 4, 2018


It is complicated to diagnose and debug complicated systems. It often takes multiple levels of diagnostics data to understand the possible causes of latency issues.

A distributed system is made of many servers that are depending on each other to serve a user request. At any time,

  • A process in the system might be handling a large number of requests.
  • In highly concurrent servers, there is no easy way to isolate events happened in the lifetime of request.
  • In highly concurrent servers, we have no good visibility into the events happened to serve a request.

As Go became a popular language to write servers in the recent years, we realized the requirement for understanding what’s going on in a Go process in the lifetime of a request.

There are many runtime activities occur during the program execution: scheduling, memory allocation, garbage collection and more. But it was impossible to associate user code with runtime events and help user analyze how runtime events affected their performance.

SRE would ask for help diagnosing a latency issue and wondering if anyone can help them with understanding and optimizing a specific server. For those even experts in Go, it was quite complicated to be able to estimate how runtime might be impacting their specific situation.

There is no easy way to tell why latency is high for certain requests. Distributed traces allow us to target which request handler to look at, but we need to dig down. Could it be due to GC, scheduler or I/O that we waited so much to serve a response? — SRE

From the perspective of someone who is external to your system, when a latency issue experienced, they don’t know much apart from the fact they know they waited more than expected (356 ms) to have a response back.

User GETs /messages and it takes 356 ms to have a response back.

From the perspective of engineers who have access to distributed traces, it is possible to see the breakdown and how each service is contributed to the overall latency. With distributed traces, we have more visibility into the situation.

See the distributed traces you collect to see the breakdown.

In this case, in order to serve /messages, we made three internal RPCs: auth.AccessToken, cache.Lookup and spanner.Query. We can see how each RPC is contributing to the latency. At this point, we can spot that auth.AccessToken took longer than usual.

We successfully scoped down the problem to a specific service. We can find the source of the auth.AccessToken by associating it with a specific process, or we can randomly try to see if the issue is reproducible on any auth service instance.

With Go 1.11, we will have additional support to the execution tracer to be able to target runtime events by RPC calls. With the new capabilities, users will be able to gather more information about what has happened in the lifetime of a call.

In this case, we are focusing on section in the auth.AccessToken span. We spent 30 +18 µs in total for networking, 5 µs on a blocking syscall, 8 µs on garbage collection and 123 µs on actually executing the handler where we mostly spend time on serialization and deserialization.

By looking at this level of detail, we finally can say we have been unfortunate that this RPC is overlapped by GC and we spent unlikely a large amount on serialization/deserialization. Then, engineers can point to the recent changes in the auth.AccessToken message and improve the performance issue. They can also see if GC is often affecting this request on the critical path and might optimize memory allocation to see if they can improve this pattern.

Go 1.11

With 1.11, Go execution tracer will introduce a few new concepts, APIs and tracing capabilities:

  • User events and user annotations, see runtime/trace.
  • Association between user code and runtime.
  • Possibility to associate execution tracer data with distributed traces.

Execution tracer is introducing two high-level concepts for users to instrument their code: regions and tasks.

Regions are sections in the code you want to collect tracing data for. A region starts and ends in the same goroutine. A task, on the other hand, is more of a logical group to categorize related regions together. A task can end in a different goroutine than the goroutine it started at.

We expect users to start an execution tracer task for each distributed trace span they have, instrument their RPC frameworks comprehensively by creating regions, enable execution tracer momentarily when there is a problem, record some data, and analyze the output.

Try it yourself

Even though the new capabilities will only be available in Go 1.11, you can give them a try by installing Go from tip.

I also recommend you to give it a try with your distributed traces. I recently added execution tracer task support for the spans created by Census.

import (   
octrace ""
ctx, span := octrace.StartSpan(ctx, "/messages")
defer span.End()
trace.WithRegion(ctx, "connection.init", conn.init)

If you are using the gRPC and HTTP integrations, you don’t have to manually create any spans because they are automatically created for you. In your handlers, you can simply use runtime/trace with the incoming context.

Register the pprof.Trace handler and collect data when you need execution tracer data for diagnosis:

import _ "net/http/pprof"go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))

Then, momentarily record data if you want execution tracer data and start the tool to visualize:

$ curl http://server:6060/debug/pprof/trace?seconds=5 -o trace.out
$ go tool trace trace.out
2018/05/04 10:39:59 Parsing trace...
2018/05/04 10:39:59 Splitting trace...
2018/05/04 10:39:59 Opening browser. Trace viewer is listening on

Then you can see a distribution of execution tracer tasks created for helloworld.Greeter.SayHello at /usertasks.

Duration distribution of RPC tasks.

You can click on the outlier latency bucket, 3981µs one to further analyze what happened in the lifetime of that specific RPC.

/userregions also allow you to list the collected regions. You can see connection.init region and number of records. (Note that connection.init is manually added to the gRPC framework code for demoing purposes, more work to instrument gRPC is still work-in-progress.)

Duration distribution of regions.

If you click on any of the links, it will give you more details about the regions that fit under that latency bucket. In the following case, we see one region collected the 1000µs bucket.

1000µs region spending extra time to wait for the GC and the scheduler.

Then you can see the fine-grained breakdown of latency. You can see 1309µs region overlapped with garbage collection. It added more overhead on the critical path in terms of GC work and scheduling. Otherwise, it took roughly the same time to execute the handler and handling the blocking syscalls.


Even though the new execution tracer capabilities are powerful, there are limitations.

  • Regions can only start and stop in the same goroutine. Execution tracer currently cannot record data automatically that crosses multiple goroutines. This requires us to instrument regions manually. The next big step will be adding more fine-grained instrumentation in RPC frameworks and standard packages such as net/http.
  • The output format of the execution tracer is hard to parse, and go tool trace is the only canonical tool that can understand this format. There is no easy way to automatically attaching execution tracer data to the distributed traces — hence we collect them separately and correlate later.


Go is trying to be a great runtime for production services. With the data available from the execution tracer, we got one step closer to have high visibility into our production servers and offer more actionable data when there is a problem.