What is Site Reliability Engineering?

Site reliability engineering, or SRE, is a relatively new field that has seen tremendous growth over the past few years. As companies embrace the cloud, develop innovative products, and become more reliant on cloud-native technologies for their operations, they are looking for software developers and engineers who can help them design and maintain systems that will continue to work properly through any situation.

In this article, we’re going to explain site reliability engineering and what implementing SRE means for your organization. This should help you decide if you should hire your own site reliability engineering team.

What is Site Reliability Engineering?

SRE practices were created at Google more than ten years ago. Well before the DevOps movement, the idea was to more closely unite the methodologies of operations teams and software engineers.

A team of engineers was asked to make Google’s already large-scale sites more efficient, scalable, and – most importantly – reliable. The SRE principles and practices they developed responded so well to Google’s needs that other big tech companies have adopted and expanded them (such as Netflix and Amazon).

“Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics,” explains Google in their book, Site Reliability Engineering - How Google Runs Production Systems. In the book, Google discusses big concepts like service-level objections and error budgets. They describe their practices around automation, troubleshooting, monitoring, managing risk, building scalable systems, and responding to emergencies. (That’s available for free, by the way, in case you’d like to check it out: Site Reliability Engineering: How Google Runs Production Systems by Niall Richard Murphy, Betsy Beyer, Chris Jones, and Jennifer Petoff)

Eventually, SRE became its own engineering domain, aimed at creating automated solutions for operations, like performance and capacity planning, on-call monitoring, and disaster response postmortems.

“But I already do DevOps,” you might be thinking. “Why do I need SRE as well?” Similar to the way that Scrum is one way to implement an Agile methodology, SRE doesn’t replace DevOps, but complements traditional DevOps practices like infrastructure automation and continuous delivery. This in turn removes organizational silos and keeps data centers and software systems available.

DevOps is how you approach design, automation, and culture to deliver rapid, high-quality service. SRE, however, is how you implement DevOps. Whereas DevOps focuses on moving through the development pipeline quickly and efficiently, SRE focuses on balancing site reliability with new features.

When you put a few people in charge of SRE, you get the following benefits:

Improved service quality and reliability.
Reduced IT time per application developed.
Developers can focus on the development pipeline, not operations tasks.
Gain greater visibility into service health by tracking the performance of all services and identifying the causes of incidents.
Quantify the cost of downtime by understanding how reliability affects sales, customer service, marketing, and other functions.
Automation eliminates manual reprogramming, which is tedious and laborious. SRE helps you recognize and respond to operational flaws without a human’s interference.
Improve the organization's incident management and reduce the number, length, and impact of outages.

Role of a Site Reliability Engineer

A site reliability engineer is a unique role. It requires experience in software development, operations, sysadmin, or in some kind of IT operations role with software development skills. These engineers are the bridge between development and operations. They split their time between operations tasks (like their on-call duties) and developing systems that increase the platform’s reliability.

SRE teams are responsible for how code is configured, deployed, and monitored. They keep tabs on the availability, latency, change management, emergency response, and capacity management of all of the organization’s services.

According to Ben Traynor, VP of engineering at Google and founder of Google SRE, “SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.” Basically, the SRE team is responsible for latency, performance, monitoring, emergency response, capacity planning, change management, and efficiency.

Google recommends limiting SREs to spending no more than 50% of their time on operational tasks. Excess operational work and poorly performing services should be sent back to the development team to handle instead of SRE spending too much time on the operations of a specific application or service.

In a nutshell, the SRE team assists the DevOps team in several important ways:

Establish service level indicators or SLIs, metrics such as latency or availability, which measure the service level the system provides and ultimately improve incident response
Establish service level objectives or SLOs, means for measuring SLIs the team can agree on
Create error budgets, which represent the maximum amount of underperforming or failing time a system can experience without violating the terms of the service-level agreement or SLA that covers it. The error budget is more than a metric, because it allows the SRE team to reconcile service reliability and pace of innovation automatically.

Ultimately, the goal of an SRE team is to automate themselves out of a job by creating systems that handle operational tasks without direct oversight. They also build self-service tools for users that rely on their services. For instance, they might design a tool that lets developers create their own testing environments. Ultimately, these kinds of tools mean less work for everyone.

Furthermore, SREs work closely with product developers to make sure their features abide by non-functional requirements (like performance, security, maintenance, and availability) and to make the CI/CD pipeline as efficient as possible.

100% Reliability?

Site reliability engineering teams understand that 100% reliability is not expected. Some failures are reasonable. Their goal, however, is to plan for the failure and come up with ways to minimize the damage and get up and running quickly.

SRE teams create error budgets for the development teams. These budgets reconcile a company’s service reliability with its pace of development and its agreements with other organizations it works with.

The developers must build the new feature or system in a manner that doesn’t overspend that budget. If a service runs within the budget, the developers can release it whenever they like. But if the system overspends the budget (by producing too many errors or crashing for too long), they can’t release it until they bring those numbers down.

For example, suppose a company’s service-level agreement (SLA) promises 99.99% uptime per year. (That’s a pretty standard target.) This means the error budget allows for four minutes and 23 seconds of downtime per year without contractual consequences.

Now let's say the development team wants to release a new feature, but that new feature is expected to create eight minutes per year of downtime. Since this exceeds the error budget, the site reliability team would ask the product team to make improvements.

Ultimately, this workflow helps development teams and operations teams make data-driven decisions about feature deployment, improve the stability and performance of services, and maximize innovation by taking calculated risks. It also ensures that features, applications, and systems only reach the user once they meet the organization’s reliability standards and the standards set forth in service-level agreements.

Do You Need Site Reliability Engineering?

Now you’re probably wondering if you need to add some site reliability engineers to your team. In most cases, you only need this kind of engineer in-house if you’re running a large-scale system that continuously deploys new code. This mostly applies to cloud-based SaaS companies who need near-perfect uptime.

If you outsource the development of your product and/or your DevOps, ask your vendor if they offer site reliability engineering. They may not have dedicated engineers who handle it for you, but they probably manage many SRE tasks in their normal service offering.

If you don’t produce any of your own software in-house, you obviously don’t need a site reliability engineering team. But when you work with an outside development shop, it’s smart to ask about their SRE capabilities.

If you’d like to add site reliability engineering to your team, reach out to us about your needs. We can supplement your team with people who handle SRE tasks, like monitoring, managing infrastructure scalability, and guiding your security. Contact us today.

Dennis