Metric collection keeps being one of the hard problems. We’ve been collecting metrics for a very long time, so why is this a hard problem still in 2020? Our workloads are becoming larger and more sophisticated. In order to produce useful metric data, we are producing and collecting richer metric data. We are also more interested to understand what else was in the context when the metric was collected. Was there a distributed trace or a log we can correlate? It’s important for us to effectively navigate among different telemetry signals during outages. We have a larger number of metadata…
My Saturday morning coffee is disrupted by this tweet:
“xooglers always be like ‘at google we…’ — devonbl
A month into my departure from Google, I can relate to this. It annoys me probably more than my current coworkers at AWS because they know it’s ok to make comparisons when you are a new employee. I avoid all possible conflict of interests. Then, I’m in meetings with open source contributors who are ex-Googlers and so many sentences start with “At Google, we did…” and I add a few more personal stories.
Let’s be fair. I always portrayed the good, the…
This article was my response to Amazon’s writing assessment when I was interviewed. I answered the question of “What is the most inventive or innovative thing you’ve done? It doesn’t have to be something that’s patented. It could be a process change, product idea, a new metric or customer facing interface — something that was your idea… [retracted]
At Google, our systems are complex and large. Our products rely on many microservices, caches, databases, storage systems and networking infrastructure. In the critical path, a user request may touch up to 100+ components. These components have different deployment and maintenance cycles…
Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. Spanner was initially built to be a key/value and was in a completely different shape than it is today and it had different goals. Since the beginning, it had transactional capabilities, external consistency and was able to failover transparently. Over time, Spanner adopted a strongly typed schema and some other relational database features. In the last years, it added SQL support*. Today we are improving both the SQL dialect and the relational database features simultaneously. Sometimes there…
Google’s Spanner is a relational database with 99.999% availability which translates to 5 mins of downtime a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.
In relational databases, providing both high availability and high consistency in writes is a very hard problem. …
A large majority of computer systems have some state and are likely to depend on a storage system. My knowledge on databases accumulated over time, but along the way our design mistakes caused data loss and outages. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.
If you are a cloud user, you probably have seen how unconventional storage options can get. This is even true for disks you access from your virtual machines. There are not many ongoing conversations or references about the underlying details of core infrastructure. One such lacking conversation is how fundamentally persistent disks and replication works.
Persistent disks are NOT local disks attached to the physical machines. Persistent disks are networking services and are attached to your VMs as network block devices. When you read or write from a persistent disk, data is transmitted over the network.
Persistent disks heavily rely…
As we collect various observability signals from systems, it fosters a new conversation around the classification of the signals.
There is a significant discussion on observability signals and even strong advocacy for one signal over the other. Metrics, events, logs, traces or others? In order to provide some structure to the conversation, it might be productive to provide a high-level breakdown of the signals based on how we utilize them.
There are three main high-level aspects observability enables:
Even though I introduce this classification, it is possible for a single signal to fit all three. It…
Google Cloud recently launched a fully managed container execution environment called Cloud Run. It is an environment specifically for request-driven workloads. It provides autoscaling, scaling down to zero, pretty fast deployments, automatic HTTPS support, global names and more. Google Cloud Run doesn’t have language runtime restrictions as soon as the language runtime is supported on gVisor. It only requires the deployments to expose an HTTP server on a port.
In order to deploy a Go server, we will start with a minimal helloworld HTTP server. The container runtime is expecting the server to listen on $PORT:
$ cat main.go …
In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.