Skip to main content

SRE (Site Reliability Engineering)

SRE is a method that operates through principles. Instead of prescribing specific solutions, it guides you with best practices. These SRE principles help organizations decide what's best for them. Once you understand the principles, you can apply them in many areas. When considering a new policy or procedure, you can judge it in the context of these principles.

All SRE principles align on one ultimate goal: customer satisfaction. By following these SRE core tenets, your efforts will make a positive impact on customers. It’s important to maintain this focus on business value.

SRE principles vs DevOps principles

SRE and DevOps both operate based on a set of principles. Both sets of principles drive alignment towards business goals. Some of their principles overlap. When comparing SRE vs DevOps, the biggest difference is that DevOps principles describe goals. SRE principles describe processes to achieve goals. In this sense, SRE best practices are a way of implementing DevOps principles.

7 fundamental principles of SRE

  1. Embracing Risk
  2. Service Level Objectives
  3. Eliminating toil
  4. Monitoring
  5. Release Engineering
  6. Automation
  7. Simplicity

Increase Developmental Velocity

Google - Site Reliability Engineering

The 7 SRE Principles And How to Put Them Into Practice

SRE OKRs

Incident Response OKRs

  • Reduce MTTR for on-call engineers by 5%
  • Develop buffers to ensure incidents remain at < 75% of the error budget
  • Mitigate false positive system alerts to reduce on-call staff costs
  • Speed up the resolution of critical incidents by 5%
  • Increase the coverage of 4-point SLIs from 90% of services to 100%
  • Reduce manual toil from 25% of responder time to 20%
  • Increase increment velocity in SRE project work with one-sprint reduction
  • Reduce operational work from 65% of total work time to 55%
  • Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
  • Assure realistic SLA targets in line with current SLIs for > 97.5% of accounts

System performance and resilience OKRs

  • Reduce 50x errors from 1% down to 0.75%
  • Increase failover design of # of microservices from the current 60% to 65%
  • Reduce network latency among the top 5 services by 2.5%
  • Increase average load speed of application by 0.25%
  • Reduce open-source-software-related errors by 10%
  • Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
  • Increase black swan event awareness among developers to 90%
  • Plan for handling unexpected high demand up to 25% burst capacity

Related article:  #9 Inside Booking.com’s Site Reliability Engineering Practice

Developer support OKRs

  • Drive rail-guided services from 40% to 50% of all new launches
  • Speed up time to production for images by 20%
  • Improve developer speed-to-publish by 10%
  • Increase tool efficiency to < 2 same-purpose tools per category across teams

DevSecOps OKRs

  • Reduce build security issues by 25%
  • Drive DevSecOps awareness among developers to 75% of the headcount
  • Drive security of database architecture with < 1 major incident per year

FinOps (Cloud Cost Control) OKRs

  • Reduce the cost of stateful storage capacity by 10%
  • Reduce total cloud billing by 1%
  • Reduce vendor-based tool costs by 10%
  • Reduce routine downtime maintenance costs by 3%

Work practices OKRs

  • Increase increment velocity in SRE project work with one-sprint reduction
  • Reduce operational work from 65% of total work time to 55%

25+ Site Reliability Engineering OKRs – Boost software reliability | SREpath