Why is metric collection still a hard problem in 2020?

5 min readDec 6, 2020

Metric collection keeps being one of the hard problems. We’ve been collecting metrics for a very long time, so why is this a hard problem still in 2020? Our workloads are becoming larger and more sophisticated. In order to produce useful metric data, we are producing and collecting richer metric data. We are also more interested to understand what else was in the context when the metric was collected. Was there a distributed trace or a log we can correlate? It’s important for us to effectively navigate among different telemetry signals during outages. We have a larger number of metadata items in our modern platforms. Image a container running on Kubernetes… You can associate the container with its cluster, node, pod, service, deployment and more. Additional metadata helps us to identify the origin of the problems, hence we collect metrics with a larger number of labels nowadays.

I daily find myself explaining why we are still investing a lot of time and energy into metric collection. In this article, I’ll overview some of the most challenging topics that make metric collection still a hard problem.

High cardinality labels

In the last decade, high cardinality labels became a hot topic in metric collection. High cardinality labels allow you to break down metrics, and allow you to filter and/or aggregate with labels. Without this capability, you have to know a lot more about your system and how it behaves to monitor it. This capability doesn’t only allow you to break down your data in various different dimensions, it allows you to create new dimensions based on existing ones dynamically.

For example, if you are interested in retrieving metrics only sourced from one service, you can filter your time series and compare them with metrics sourced at other services at your monitoring dashboard. Without high cardinality, you would collect metrics separately in order to achieve the same behavior. In an ever complicated stack where you don’t have access to the source code or cannot redeploy your code each time you need new filters and aggregation, labelling can be critical.

Unfortunately, there are limits to how much cardinality your collection system can handle. Previously, most metric collection systems and time series storage systems were not built to store and query high cardinality data at all. Most systems are not historically instrumented to ingest high cardinality labels either. Even though high cardinality labels are becoming a bigger necessity, the industry still is struggling with best practices on labelling, we don’t always have capabilities to generate metrics with labels, and we don’t have a metric collection backend to store and query high cardinality metrics. Prometheus and Monarch are two examples of metric collection systems that support labelled metrics.

Cross-stack labels

Even though metric labelling is seeing more adoption, being able to propagate labels on wire is still an unsolved puzzle. There are no well established propagation standards on wire or in language runtimes.

Cross-stack labelling is a key technique that differentiates metric collection in microservices environment from the instrumentation of monolithic systems. When microservices systems began to become larger, some practices from the monolithic systems didn’t scale.

In microservices architectures, it is likely that some services will be a common dependency for a lot of teams. Some examples of these services are authentication and databases that everyone needs and ends up depending on. The solution to this problem is to produce the labels at the upper-level services that calls into the lower-level services. After producing these labels, we will propagate them on the wire as a part of our requests. The lower level services can use the incoming labels when record the metrics. This is how we have fine-grained dimensions at the lower ends of the stack regardless of how many layers of services there are above.

Without strong propagation standards and first class support from the entire stack, this is a hard problem.

Correlation and exemplars

In a typical telemetry data collection pipeline, there is often more than just metrics. We instrument our services with a variety of different tools. One example everyone can relate to is logs. Others could be events, distributed traces, runtime profiles coming from production and any other telemetry data you can name.

Given the large number of variety and the large amount of data, it’s not easy to navigate the collected telemetry data especially during incidents. Being able to see correlated telemetry data from metric dashboards is a great capability.

Historically, correlation between metrics and other telemetry data didn’t exist in common tools. Our instrumentation libraries didn’t have capabilities to collect exemplars as it’s recording new metric values. OpenCensus is an example instrumentation library that collects trace exemplars with metrics. Honeycomb and Grafana are examples that allow you to navigate to traces from metric dashboards.

Export formats

Collecting metrics has always been a difficult topic because there are way many ways how services and platforms export metric data. Each year, there are a few new initiatives that is trying to solve this outstanding problem.

As a result, we have many collection paths, and are asking our users to deploy and maintain more than one collection pipeline.

Pull vs push

In a pull model, your metric collection system is pulling metrics from your services whereas in a push-model your services are pushing the metrics to a metric collection service. This topic is a routine debate topic even though there is no single answer to the problem whether pulling or pushing metrics is a better approach.

Even though both has its pros and cons, having to support both pull and push in the same collection pipeline is not always an easy task. Prometheus is an example of a pull system whereas Monarch supports pushes and it instructs clients how often they should push. We often have to care about both of the models because our users often need a combination of both.

Aggregation

Metric collection pipelines often aggregates data because reporting each individual metric collection wouldn’t scale in large systems. Aggregation often happens in multiple layers. Metrics collected from the services can be continuously aggregated along the way until it is in long-term storage. Building and operating aggregation pipelines are not always trivial. Additional to pipelines, there are instrumentation libraries that start aggregating inside the application processes. Figuring out the right aggregation window and fine-tuning aggregations for performance can be a hard problem.