Metric collection keeps being one of the hard problems. We’ve been collecting metrics for a very long time, so why is this a hard problem still in 2020? Our workloads are becoming larger and more sophisticated. In order to produce useful metric data, we are producing and collecting richer metric data. We are also more interested to understand what else was in the context when the metric was collected. Was there a distributed trace or a log we can correlate? It’s important for us to effectively navigate among different telemetry signals during outages. We have a larger number of metadata items in our modern platforms. Image a container running on Kubernetes… You can associate the container with its cluster, node, pod, service, deployment and more. …
My Saturday morning coffee is disrupted by this tweet:
“xooglers always be like ‘at google we…’ — devonbl
A month into my departure from Google, I can relate to this. It annoys me probably more than my current coworkers at AWS because they know it’s ok to make comparisons when you are a new employee. I avoid all possible conflict of interests. Then, I’m in meetings with open source contributors who are ex-Googlers and so many sentences start with “At Google, we did…” and I add a few more personal stories.
Let’s be fair. I always portrayed the good, the bad and the worst from past experiences. It’s critical to learn and not to repeat the same obvious mistakes. And if there is anything good enough you can take from your past, it’s ok to bring that opportunity and learning. I don’t always enjoy naming companies when referring to experiences but that’s how you provide context and be honest to your audience. …
This article was my response to Amazon’s writing assessment when I was interviewed. I answered the question of “What is the most inventive or innovative thing you’ve done? It doesn’t have to be something that’s patented. It could be a process change, product idea, a new metric or customer facing interface — something that was your idea… [retracted]
At Google, our systems are complex and large. Our products rely on many microservices, caches, databases, storage systems and networking infrastructure. In the critical path, a user request may touch up to 100+ components. These components have different deployment and maintenance cycles. Most of our services are built and maintained by different teams. Teams rarely have full insight into each other. Measuring latency, identifying latency issues and being able to debug them effectively is an important capability for the reliability of our systems. Producing actionable telemetry data and improving its navigability have been a critical task for my team. …
Spanner is a distributed database Google initiated a while ago to build a highly available and highly consistent database for its own workloads. Spanner was initially built to be a key/value and was in a completely different shape than it is today and it had different goals. Since the beginning, it had transactional capabilities, external consistency and was able to failover transparently. Over time, Spanner adopted a strongly typed schema and some other relational database features. In the last years, it added SQL support*. Today we are improving both the SQL dialect and the relational database features simultaneously. Sometimes there is confusion whether Spanner supports SQL or not. The short answer is yes. …
Google’s Spanner is a relational database with 99.999% availability which translates to 5 mins of downtime a year. Spanner is a distributed system and can span multiple machines, multiple datacenters (and even geographical regions when configured). It splits the records automatically among its replicas and provides automatic failover. Unlike traditional failover models, Spanner doesn’t failover to a secondary cluster but can elect an available read-write replica as the new leader.
In relational databases, providing both high availability and high consistency in writes is a very hard problem. …
A large majority of computer systems have some state and are likely to depend on a storage system. My knowledge on databases accumulated over time, but along the way our design mistakes caused data loss and outages. In data-heavy systems, databases are at the core of system design goals and tradeoffs. Even though it is impossible to ignore how databases work, the problems that application developers foresee and experience will often be just the tip of the iceberg. In this series, I’m sharing a few insights I specifically found useful for developers who are not specialized in this domain.
If you are a cloud user, you probably have seen how unconventional storage options can get. This is even true for disks you access from your virtual machines. There are not many ongoing conversations or references about the underlying details of core infrastructure. One such lacking conversation is how fundamentally persistent disks and replication works.
Persistent disks are NOT local disks attached to the physical machines. Persistent disks are networking services and are attached to your VMs as network block devices. When you read or write from a persistent disk, data is transmitted over the network.
Persistent disks heavily rely on Google’s file system called Colossus. Colossus is a distributed block storage system that is serving most of the storage needs at Google. Persistent disk drivers automatically encrypt your data on the VM before it goes out of your VM and transmitted on the network. Then, Colossus persists the data. Upon a read, the driver decrypts the incoming data. …
As we collect various observability signals from systems, it fosters a new conversation around the classification of the signals.
There is a significant discussion on observability signals and even strong advocacy for one signal over the other. Metrics, events, logs, traces or others? In order to provide some structure to the conversation, it might be productive to provide a high-level breakdown of the signals based on how we utilize them.
There are three main high-level aspects observability enables:
Even though I introduce this classification, it is possible for a single signal to fit all three. It is not the ideal option but imagine how people rely on watching their logs to see a service is up. They are utilizing logs to measure latency for certain operations, and dig into logs to see what’s going on when there is an issue. Even though logs can fit into all three categories, it requires distinct design and planning to make them useful in these categories. And although it is possible to utilize the same signal for all three aspects, we may rely on different signals for each because each signal type has its pros/cons. …
Google Cloud recently launched a fully managed container execution environment called Cloud Run. It is an environment specifically for request-driven workloads. It provides autoscaling, scaling down to zero, pretty fast deployments, automatic HTTPS support, global names and more. Google Cloud Run doesn’t have language runtime restrictions as soon as the language runtime is supported on gVisor. It only requires the deployments to expose an HTTP server on a port.
In order to deploy a Go server, we will start with a minimal helloworld HTTP server. The container runtime is expecting the server to listen on $PORT:
$ cat main.go …
In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.