Having end-to-end distributed traces is a huge challenge for any project. In distributed tracing, end-to-end tracing is a term often used to refer to traces that capture most components in a critical path. Imagine an HTTP request made to trigger a Lambda function. Being able to see the HTTP client, load balancing and scheduling decisions, any outgoing requests made from the Lambda function to serve the request would be “more end-to-end” than just capturing a trace with a function invocation. End-to-end traces are critical because they tell you the larger picture story and how your systems interact with others. …


Metric collection keeps being one of the hard problems. We’ve been collecting metrics for a very long time, so why is this a hard problem still in 2020? Our workloads are becoming larger and more sophisticated. In order to produce useful metric data, we are producing and collecting richer metric data. We are also more interested to understand what else was in the context when the metric was collected. Was there a distributed trace or a log we can correlate? It’s important for us to effectively navigate among different telemetry signals during outages. We have a larger number of metadata…


My Saturday morning coffee is disrupted by this tweet:

“xooglers always be like ‘at google we…’ — devonbl

A month into my departure from Google, I can relate to this. It annoys me probably more than my current coworkers at AWS because they know it’s ok to make comparisons when you are a new employee. I avoid all possible conflict of interests. Then, I’m in meetings with open source contributors who are ex-Googlers and so many sentences start with “At Google, we did…” and I add a few more personal stories.

Let’s be fair. I always portrayed the good, the…


This article was my response to Amazon’s writing assessment when I was interviewed. I answered the question of “What is the most inventive or innovative thing you’ve done? It doesn’t have to be something that’s patented. It could be a process change, product idea, a new metric or customer facing interface — something that was your idea… [retracted]

At Google, our systems are complex and large. Our products rely on many microservices, caches, databases, storage systems and networking infrastructure. In the critical path, a user request may touch up to 100+ components. These components have different deployment and maintenance cycles…


Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. Spanner was initially built to be a key/value and was in a completely different shape than it is today and it had different goals. Since the beginning, it had transactional capabilities, external consistency and was able to failover transparently. Over time, Spanner adopted a strongly typed schema and some other relational database features. In the last years, it added SQL support*. Today we are improving both the SQL dialect and the relational database features simultaneously. Sometimes there…


Google’s Spanner is a relational database with 99.999% availability which translates to 5 mins of downtime a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.

In relational databases, providing both high availability and high consistency in writes is a very hard problem. …


A large majority of computer systems have some state and are likely to depend on a storage system. My knowledge on databases accumulated over time, but along the way our design mistakes caused data loss and outages. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.

  1. You…


If you are a cloud user, you probably have seen how unconventional storage options can get. This is even true for disks you access from your virtual machines. There are not many ongoing conversations or references about the underlying details of core infrastructure. One such lacking conversation is how fundamentally persistent disks and replication works.

Disks as services

Persistent disks are NOT local disks attached to the physical machines. Persistent disks are networking services and are attached to your VMs as network block devices. When you read or write from a persistent disk, data is transmitted over the network.

Persistent disks are using Colossus for storage backend.

Persistent disks heavily rely…


As we collect various observability signals from systems, it fosters a new conversation around the classification of the signals.

There is a significant discussion on observability signals and even strong advocacy for one signal over the other. Metrics, events, logs, traces or others? In order to provide some structure to the conversation, it might be productive to provide a high-level breakdown of the signals based on how we utilize them.

There are three main high-level aspects observability enables:

  1. Health
  2. Availability
  3. Debuggability

Even though I introduce this classification, it is possible for a single signal to fit all three. It…


Google Cloud recently launched a fully managed container execution environment called Cloud Run. It is an environment specifically for request-driven workloads. It provides autoscaling, scaling down to zero, pretty fast deployments, automatic HTTPS support, global names and more. Google Cloud Run doesn’t have language runtime restrictions as soon as the language runtime is supported on gVisor. It only requires the deployments to expose an HTTP server on a port.

In order to deploy a Go server, we will start with a minimal helloworld HTTP server. The container runtime is expecting the server to listen on $PORT:

$ cat main.go …

Jaana Dogan

See rakyll.org for more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store