How to peacefully grow your service

In the last almost two decades, I have seen numerous Internet services ranging from small services for niche markets to Tier 1 services for major Internet companies. In this period, apart from database or application design, the technical success of a growing product benefited significantly from adopting a few characteristics. In this article, I’ll capture a few to highlight what helped us.

Feature boundaries

Data autonomy

Data autonomy means the consumers of your data shouldn’t have direct database access to your database, shouldn’t be able to run JOINs between their tables and yours, shouldn’t be able to introduce new data operation without your approval, and doesn’t need to know about the details of your storage layer to be a consumer. It’s almost impossible to take away these permissive characteristics once the ship has sailed.

Cell-based architecture

A multi-regional cell-based architecture where two services and their data partition is a cell. A user request will be routed to the region and cell the user data lives in.

Cell-based architecture makes you invest in automation which makes it easier to build new cells and regions as you grow. Cell-based architectures can have tooling in place to migrate tenants among cells in case of a capacity concern.

Cell-based architectures can be used to provide redundancy and auto failover. A cell could be a replica of another cell, and the load balancer can fallback to the new primary if a cell goes away. Replicas are often run in different availability zones to reduce the impact of a zonal outage.

Cell-based architectures limit the blast radius in an outage or a security event. Combined with automated continuous delivery practices, it can quickly rollback changes without impacting all of your customers. See Automating safe, hands-off deployments for an example how Amazon uses cells to reduce the impact of bad pushes.

Service and database dependencies

If you need to rely on a dependency with a lower SLO while maintaining a higher SLO, you can move your calls outside of critical paths (e.g. running them in background jobs), fallback to a default/cheaper behavior, or gracefully degrade the experience.

Graceful degradation

Unique identifiers

Idempotency

SLOs and quotas

Quotas or rate limits allow you to reject aggressive call patterns from upstream services. Introducing quotas at a later time, if they are not permissive, could be a breaking change. It’s fairly useful to define them early and set them in place.

Observability

Dimensions are arbitrary labels collected with telemetry. It allows engineers to narrow down telemetry to identify the blast radius during an outage. For example, being able to break down a service’s latency metrics by Kubernetes cluster name can allow you to see which clusters are affected. Dimensions can be used to see other relevant telemetry. Maybe, you’d like to view logs available for the affected clusters to troubleshoot. Dimensions, when consistently and automatically applied, will improve your incident response and troubleshooting capabilities. Dimensions can get out of the hand if their cardinality becomes too high for metrics. It’s critical to build fundamentals to be able to adjust dimensions.

Propagation standards, such as a trace context or how dimensions are propagated, are another topic where retrofits are costly and sometimes impossible. The earlier you align in terms of request headers and runtime context propagation, the lesser of a problem it will be.

The success of Internet services rely on a wide range criteria such as good database design, capacity planning, CI/CD capabilities, understanding the limitations of your service dependencies, and most importantly technical and product talent. In this article, I tried to capture some characteristics that are often undermined but will easily pay off. Some early stage decisions can save years of investments in the long run, and you can use those years building great products for your customers instead of fighting fires.

--

--

See rakyll.org for more.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store