Cindy Sridharan’s fantastic article on why everyone is not ops makes you rethink of the relationship between your development and operations teams. In this article, I am trying explain how Google approaches the problem and why it scales for our SRE team.
At Google, SRE stands for site reliability engineer. Site reliability is about velocity and productivity of our engineers, the performance and reliability of our products, and the health of our code base and production environment. I don’t like to say SRE is the Google’s way of doing ops because SRE is a significant rethink of how we do ops. SRE is a standalone organization and is an independent silo at Google. They maintain large production systems at Google, they are the go-to-team for consultancy about anything production related, they set the best practices, they contribute to infra and tools that makes production easy for our software engineers.
What makes Google SRE significantly different is not just their world-class expertise but the fact that they are optional at Google. Yes, I am not missing negation in the previous sentence. They are optional. When we begin working on a new product/project, the development team owns every aspect. From writing design docs to writing code. From unit tests to integration tests. We go through a large series of reviews from security to privacy to production readiness. We are responsible to deploy our code, monitor it, be on call, and put water on fire when required. We do it all ourselves as if there is no SRE or we are our own SREs.
But how does it work if SRE is optional? Working at Google provides you a whole suite of infrastructure you always take for granted. Networking, storage systems, lock systems, auto scaling and scheduling, naming, configuration, and many more. The infrastructure components are staffed by software engineers and often supported by SRE. On the other hand, SRE is not an organization that helps every team in person but they build reusable best practices and support critical technical infrastructure services making production experience better. SRE culture and best practices are very established at Google. Do you want to deploy a production service that scales to the world? We have infrastructure helps you with that. Do you want to have world-class dashboards? We have that. Do you need a plan and better understanding how you should monitor your code? SRE have solutions and best practices for that. Do you want to roll out a new critical service? SRE provide consultancy for that.
The main idea is that SRE organization is not responsible to support any product at Google. You all get the infra and SRE best practices for free, and deserve part-time and later full-time SRE support by becoming a critical and large-scale product. An average timeline how to get SRE support:
- Build a product, coordinate with the team supports launches, ask for SRE consultancy if required.
- Set an SLO, try to recruit part-time SRE support once you hit the critical scale.
- SRE team will require a list of requirements until your product is suitable for their support. Once you meet their criteria, start adding SRE to your on-call rotation.
- Grow the SRE support by using your headcount as your scale grow. Keep your development team responding to the prod issues part-time, so they still understand what’s going on at prod.
- Downscale the SRE support if your project is shrinking in scale, and finally let your development team own the SRE work if the scale doesn’t require SRE support.
This model gives the SRE organization to focus on solutions that scale rather than investing a lot of time on specific products that are not impactful. The SRE headcount on a team is coming from the development team’s headcount, so the development team would prefer to handle SRE job themselves if they are not large enough to ask for additional help. For complex systems and large-scale infrastructure, SRE is there in person as a part of the team. And as they learn, they also contribute to the infrastructure, tools, and knowledge that are reusable by all of the engineering teams at Google.
But does it make everyone ops? At Google, we probably have access to the world’s finest infrastructure to build large-scale systems. Individual teams never have to care about a lock system, databases or our internal naming service. The internal infrastructure is staffed to work and works well. On the top of that, we have a very established SRE culture and software engineers can think and act as an SRE until it is beyond their scale by just adopting the fundamentals and the existing infrastructure. This model helps the software engineers to have a clear understanding of the operational aspects, and give the SRE team the opportunity to be able to focus on impactful projects in a highly sustainable way. I think the industry needs a breakdown between product and infra engineering and start talking how we staff infra teams and support product development teams with SRE. The “DevOps” conversation is often not complete without this breakdown and assuming everyone is self serving their infra and ops all the times.
It is worth to note that Google also has program that allows software engineers to switch to an SRE role for six months called Mission Control. This program allows software engineers to have more in depth understanding how SRE team operates on large-scale systems. Upon finishing the program, they can take back the knowledge and hands-on expertise to their development teams.