SRE Explained: Or, How I Learned to Stop Worrying and Love the Pager

November 22, 2025•10 Min•by Muhammad Fahid Sarker

SRESite Reliability EngineeringDevOpsSLOSLIError BudgetAutomationMonitoringObservabilityCloud InfrastructureSystem DesignPrometheus

SRE Explained: Or, How I Learned to Stop Worrying and Love the Pager

The 3 AM Scream

We’ve all been there. It's 3 AM. Your phone buzzes with the fury of a thousand angry bees. An alert screams: SERVICE 'user-auth' IS DOWN. You roll out of bed, trip over a cat, and stumble to your laptop. Your mission: bring the service back to life before the world wakes up and notices.

This is the classic life of an Operations (Ops) person. They are the brave guardians of production, the firefighters of the digital world.

On the other side of the wall, you have Developers (Devs). Their mission is to build and ship cool new features as fast as possible. They're the race car drivers, always pushing the limits.

See the problem? The Ops team wants stability ("Please, for the love of all that is holy, don't change anything!"), while the Dev team wants velocity ("Hold my beer, I'm deploying this new AI-powered button that changes color based on your mood.").

This tension is the 'Wall of Confusion' in DevOps lore. And Site Reliability Engineering (SRE) is the sledgehammer that smashes it.

So, What the Heck is SRE?

The simplest definition comes from the person who coined the term at Google, Ben Treynor Sloss:

"SRE is what happens when you ask a software engineer to design an operations team."

Instead of manually fixing problems, an SRE's first instinct is to ask: "How can I write code to fix this for me forever?"

SRE treats operations as a software problem. Your infrastructure, your deployment pipeline, your monitoring—it's all just a big, distributed system that can be managed, automated, and improved with code.

Traditional Ops	SRE (Site Reliability Engineering)
Manual Toil: Manually restarts a server.	Automation: Writes a script that detects the failure and automatically restarts the server.
Reactive: Fights fires as they appear.	Proactive: Analyzes why the server failed and writes code to prevent it from happening again.
Ticket-based: Works through a queue of tickets.	Error Budget-based: Uses data to decide when to allow changes and when to focus on stability.

The Secret Sauce: The Holy Trinity of SRE

SRE isn't just a vibe; it's a data-driven practice. It revolves around three key concepts that will change the way you think about reliability.

1. SLI (Service Level Indicator)

An SLI is simply a measurement of something. It's a quantifiable metric of your service's performance.

Good SLI: The percentage of HTTP requests that completed successfully.
Good SLI: The latency of the 95th percentile of requests to your API.
Bad SLI: CPU utilization. (Why? Because high CPU isn't necessarily a problem if your users are still getting fast, successful responses. SLIs should be user-centric!)

2. SLO (Service Level Objective)

An SLO is the target you set for your SLI. It's your promise to your users.

SLI: Request success rate.
SLO: 99.9% of requests will be successful over a 30-day period.

This is a clear, unambiguous goal. We either meet it or we don't.

3. Error Budget

This is the magic part. The part that breaks down the wall between Dev and Ops.

An Error Budget is simply 100% - SLO.

If your SLO is 99.9% availability, your Error Budget is 0.1%. This 0.1% is the acceptable amount of "unreliability" you are allowed to have over a period (e.g., a month).

How does this solve anything?

The Error Budget becomes the single, data-driven arbiter for deciding between shipping new features and focusing on reliability.

Got plenty of Error Budget left? Great! The Dev team can take risks. Ship that new feature! Try that experimental database! If it causes a few errors, it's okay—we've budgeted for it.
Is the Error Budget almost gone? Freeze all new releases! It's all hands on deck for the Dev and SRE teams to work on stability, fix bugs, and improve performance until the service is reliable again.

Suddenly, Devs and SREs are on the same team with a shared goal. Devs want to ship features, so they have a vested interest in not burning the error budget. SREs are happy to let them ship as long as the budget allows.

Let's See Some Code: The SRE Mindset in Action

Imagine you keep getting paged because a service's memory usage spikes, and it needs a manual restart.

The Traditional Ops Approach:

Get paged at 3 AM.
SSH into the server.
Run sudo systemctl restart my-flaky-app.
Go back to bed, knowing it will happen again.

The SRE Approach:

Step 1: Measure (Define an SLI) First, let's make sure we're measuring the user impact, not just the server's memory. The real problem is that when memory spikes, the app's API starts throwing 500 errors. So, our SLI is the error rate.

Step 2: Automate the Fix (But Make it Smart) Instead of just restarting, let's write a small Python script that acts as a 'remediation bot'. This could be part of a larger system like a Kubernetes Operator or just a standalone script run by a cron job.

python
import requests
import os
import time

PROMETHEUS_URL = "http://prometheus.local:9090/api/v1/query"
SERVICE_NAME = "my-flaky-app"
ERROR_THRESHOLD = 5 # Percentage of errors over the last 5 minutes

def get_error_rate():
    """Queries Prometheus for the 5-minute error rate of our service."""
    # This PromQL query calculates the percentage of 5xx responses
    query = f'sum(rate(http_requests_total{{job="{SERVICE_NAME}", code=~"5.."}}[5m])) / sum(rate(http_requests_total{{job="{SERVICE_NAME}"}}[5m])) * 100'
    
    try:
        response = requests.get(PROMETHEUS_URL, params={'query': query})
        response.raise_for_status()
        result = response.json()['data']['result']
        if result:
            error_percentage = float(result[0]['value'][1])
            print(f"Current error rate: {error_percentage:.2f}%")
            return error_percentage
    except Exception as e:
        print(f"Error querying Prometheus: {e}")
        return 0 # Fail safe
    return 0

def restart_service():
    """Safely restarts the service. In real life, this would use an API (e.g., Kubernetes API)."""
    print("Error threshold breached! Restarting service...")
    # In a real K8s world: os.system(f'kubectl rollout restart deployment/{SERVICE_NAME}')
    os.system(f'echo "Restarting {SERVICE_NAME}" > /tmp/restart.log') # Dummy command for example
    print("Service restart initiated.")

if __name__ == "__main__":
    error_rate = get_error_rate()
    if error_rate > ERROR_THRESHOLD:
        # We are burning our error budget too fast!
        restart_service()
    else:
        print("Service is healthy. No action needed.")

Step 3: Eliminate the Problem This script stops the 3 AM pages, but it doesn't solve the root cause. The SRE's job isn't done. Now they have the breathing room (and the data from Prometheus) to work with developers to find and fix the memory leak, eliminating the problem for good.

What Problems Does SRE Solve?

It ends the Dev vs. Ops war: Creates a shared language and common goals (the Error Budget).
It makes reliability a feature: Treats stability and performance as first-class features of the product, not an afterthought.
It reduces 'toil': Aggressively automates repetitive, manual tasks, freeing up engineers to work on high-impact projects.
It enables you to move faster, more safely: By quantifying risk with Error Budgets, you can make informed decisions about when to push forward and when to hold back.

So next time your pager goes off, don't just fix the problem. Ask yourself: "How can I write code so this never, ever wakes me up again?"

That, my friend, is the SRE way.

AIOps for Humans: What It Is, Why DevOps Loves It, and How to Start

A friendly, beginner-friendly guide to AIOps: its core ideas, how it helps DevOps, real use cases, simple code examples, and a practical roadmap to get started.

December 20, 2025

Event-driven Architecture and DevOps Explained Simply

Clear, friendly guide to what event-driven architecture and DevOps are, what problems they solve, how they work, plus simple code and pipeline examples for programmers.

December 19, 2025

Don't Use Kubernetes! (Unless You Really, Really Should)

Everyone's talking about Kubernetes, but is it always the right choice? Let's explore with humor and examples when this powerful tool is actually overkill and what you should use instead.

December 6, 2025

Kubernetes: The Overworked Restaurant Manager Your Code Desperately Needs

Ever wondered why everyone's talking about Kubernetes? We break down this DevOps giant by comparing it to a chaotic but brilliant restaurant manager. No jargon, just fun analogies and code!

November 29, 2025