My Saturday morning coffee is disrupted by this tweet:

“xooglers always be like ‘at google we…’ — devonbl

A month into my departure from Google, I can relate to this. It annoys me probably more than my current coworkers at AWS because they know it’s ok to make comparisons when you are a new employee. I avoid all possible conflict of interests. Then, I’m in meetings with open source contributors who are ex-Googlers and so many sentences start with “At Google, we did…” and I add a few more personal stories.

Let’s be fair. I always portrayed the good, the bad and the worst from past experiences. It’s critical to learn and not to repeat the same obvious mistakes. And if there is anything good enough you can take from your past, it’s ok to bring that opportunity and learning. I don’t always enjoy naming companies when referring to experiences but that’s how you provide context and be honest to your audience. …

This article was my response to Amazon’s writing assessment when I was interviewed. I answered the question of “What is the most inventive or innovative thing you’ve done? It doesn’t have to be something that’s patented. It could be a process change, product idea, a new metric or customer facing interface — something that was your idea… [retracted]

At Google, our systems are complex and large. Our products rely on many microservices, caches, databases, storage systems and networking infrastructure. In the critical path, a user request may touch up to 100+ components. These components have different deployment and maintenance cycles. Most of our services are built and maintained by different teams. Teams rarely have full insight into each other. Measuring latency, identifying latency issues and being able to debug them effectively is an important capability for the reliability of our systems. Producing actionable telemetry data and improving its navigability have been a critical task for my team. …

Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. …

Google’s Spanner is a relational database with 99.999% availability which translates to 5 mins of downtime a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.

In relational databases, providing both high availability and high consistency in writes is a very hard problem. …

A large majority of computer systems have some state and are likely to depend on a storage system. My knowledge on databases accumulated over time, but along the way our design mistakes caused data loss and outages. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.

  1. You are lucky if 99.999% of the time network is not a problem. …

If you are a cloud user, you probably have seen how unconventional storage options can get. This is even true for disks you access from your virtual machines. There are not many ongoing conversations or references about the underlying details of core infrastructure. One such lacking conversation is how fundamentally persistent disks and replication works.

Persistent disks are NOT local disks attached to the physical machines. Persistent disks are networking services and are attached to your VMs as network block devices. When you read or write from a persistent disk, data is transmitted over the network.

Image for post
Image for post
Persistent disks are using Colossus for storage backend.

Persistent disks heavily rely on Google’s file system called Colossus. Colossus is a distributed block storage system that is serving most of the storage needs at Google. Persistent disk drivers automatically encrypt your data on the VM before it goes out of your VM and transmitted on the network. Then, Colossus persists the data. Upon a read, the driver decrypts the incoming data. …

As we collect various observability signals from systems, it fosters a new conversation around the classification of the signals.

There is a significant discussion on observability signals and even strong advocacy for one signal over the other. Metrics, events, logs, traces or others? In order to provide some structure to the conversation, it might be productive to provide a high-level breakdown of the signals based on how we utilize them.

There are three main high-level aspects observability enables:

  1. Health
  2. Availability
  3. Debuggability

Even though I introduce this classification, it is possible for a single signal to fit all three. It is not the ideal option but imagine how people rely on watching their logs to see a service is up. They are utilizing logs to measure latency for certain operations, and dig into logs to see what’s going on when there is an issue. Even though logs can fit into all three categories, it requires distinct design and planning to make them useful in these categories. And although it is possible to utilize the same signal for all three aspects, we may rely on different signals for each because each signal type has its pros/cons. …

Google Cloud recently launched a fully managed container execution environment called Cloud Run. It is an environment specifically for request-driven workloads. It provides autoscaling, scaling down to zero, pretty fast deployments, automatic HTTPS support, global names and more. Google Cloud Run doesn’t have language runtime restrictions as soon as the language runtime is supported on gVisor. It only requires the deployments to expose an HTTP server on a port.

In order to deploy a Go server, we will start with a minimal helloworld HTTP server. The container runtime is expecting the server to listen on $PORT:

$ cat main.go …

In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.

  • Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
  • Define and set SLOs for your service at design time. …

Go 1.7 introduced a built-in context type a while ago. In systems, context can be used pass request-scoped metadata such as a request ID among different functions, threads and even processes.

Go originally introduced the context package to the standard library to unify the context propagation inside the same process. …

Jaana Dogan

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store