Keeping Your Promises: A Fun Guide to SLIs, SLOs, and SLAs

November 22, 2025•10 Min•by Muhammad Fahid Sarker

SRESite Reliability EngineeringSLISLOSLADevOpsSystem MonitoringReliabilityError BudgetObservability

Keeping Your Promises: A Fun Guide to SLIs, SLOs, and SLAs

So, Your App is Down... Again.

We’ve all been there. It’s 3 AM, your phone is buzzing like an angry hornet, and the only thing you can think is, "Oh no, what did I break this time?" Users are complaining on Twitter that your service is "slow," "broken," or "a potato-powered hamster wheel."

But what does "slow" even mean? How much "broken" is too much? How do you turn vague user rage into actionable engineering goals?

Fear not, my friend! Enter the three musketeers of reliability: SLIs, SLOs, and SLAs. They sound like a law firm, but I promise they're way more fun. To make this easy, let's imagine we're not running a web service, but the world's most critical enterprise: a pizza delivery service.

SLI: The Thermometer (What are we measuring?)

An SLI, or Service Level Indicator, is simply a measurement. It’s a number. It’s the raw, unfiltered truth about your service's performance. It doesn't have an opinion; it just states facts.

For our pizza joint, an SLI could be:

Pizza Delivery Time: How long did it take from order to doorstep? (e.g., 27 minutes)
Pizza Temperature: Was the pizza hot on arrival? (e.g., 150°F)
Order Accuracy: Did the customer get the pepperoni they asked for, or did they get pineapple and despair? (e.g., 99.8% of orders are correct)

In the tech world, common SLIs are:

Availability: Is the service responding to requests? (e.g., the percentage of successful HTTP responses, i.e., not 5xx errors).
Latency: How long does it take to get a response? (e.g., the time to process an API request).
Throughput: How many requests can we handle per second?

An SLI is something you can measure. If you can't put a number on it, it's not an SLI.

Let's look at a simple code example. Imagine you have a log of your API's HTTP status codes. You can calculate your availability SLI like this:

python
def calculate_availability_sli(status_codes):
    """Calculates the percentage of successful requests (not 5xx)."""
    total_requests = len(status_codes)
    if total_requests == 0:
        return 100.0 # No requests, perfect availability? Let's say yes!

    successful_requests = sum(1 for code in status_codes if code < 500)
    
    availability_percentage = (successful_requests / total_requests) * 100
    return availability_percentage

# A log of requests from the last 5 minutes
requests_log = [200, 200, 503, 201, 404, 200, 500, 200]

availability = calculate_availability_sli(requests_log)

print(f"Our availability SLI is currently: {availability:.2f}%")
# Output: Our availability SLI is currently: 75.00%

See? The SLI is just a number: 75.00%. It doesn't say if that's good or bad. It's just... a fact.

SLO: The Promise to Ourselves (What's our goal?)

An SLO, or Service Level Objective, is the goal you set for your SLI. It's where you draw a line in the sand and say, "This is what we consider a good performance."

This is an internal target. It's a promise you make to your team to keep yourselves honest and your users happy.

Back to our pizza shop:

SLO for Delivery Time: We will deliver 99% of our pizzas in under 30 minutes.

Notice the specifics. It's not just "we'll be fast." It's a precise target (99%) for an SLI (delivery time) over a certain threshold (< 30 mins).

In tech:

SLO for Availability: The calculate_availability_sli function will return at least 99.9% over a rolling 28-day window.
SLO for Latency: 95% of all API requests to the /users endpoint will be served in under 200ms.

The Magic of the Error Budget

SLOs give us something amazing: an Error Budget. If your availability SLO is 99.9%, it means you have a 0.1% budget for failure.

Error Budget = 100% - SLO

This is liberating! It means you don't have to be perfect. You can use this budget to take calculated risks, like deploying new features or performing risky maintenance. As long as you're within budget, you can innovate. If you burn through your budget, all hands on deck! It's time for a feature freeze to focus on stability.

python
# Let's check if we are meeting our SLO
SLO_TARGET = 99.9
current_availability_sli = 99.95 # We are doing great!

error_budget_remaining = current_availability_sli - SLO_TARGET

if error_budget_remaining >= 0:
    print(f"Hooray! We are meeting our SLO. We have {error_budget_remaining:.2f}% of budget left. Let's ship that new feature!")
else:
    print(f"ALERT! We have burned our error budget by {-error_budget_remaining:.2f}%. Freeze all deploys! Fix the things!")

SLA: The Pinky Promise with Consequences (What if we fail?)

An SLA, or Service Level Agreement, is the promise you make to your customers. It's a formal contract, often with financial penalties if you fail to meet it.

An SLA is basically a simplified, more lenient version of your SLO. You give yourself a buffer because breaking an SLA costs you real money or reputation.

For our pizza empire:

SLA: If your pizza is not delivered in 45 minutes, you get it for free.

See the difference? Our internal goal (SLO) is 30 minutes. But the external promise (SLA) is 45 minutes. This buffer zone protects us. We might miss our internal goal and have a team meeting about it, but we hopefully won't miss the customer-facing promise and have to give away free pizza.

In the tech world:

SLA: We guarantee 99.5% uptime per billing cycle. If we fail, customers will receive a 10% credit on their next bill.

Notice that the SLA uptime (99.5%) is lower than our internal SLO target (99.9%). This is crucial. Your SLO should always be stricter than your SLA.

SLO > SLA. Always.

If you start breaking your SLO, it's an early warning sign that your SLA might be in danger. It's the canary in the coal mine.

Tying It All Together

Let's recap the relationship with our pizza analogy:

You measure everything (SLI): The delivery time for every single pizza is logged. This is your data.
You set an internal goal (SLO): To keep the quality high and the team sharp, you aim to have 99% of deliveries under 30 minutes.
You make a public promise (SLA): You tell your customers, "Guaranteed in 45 minutes or it's free!" This manages their expectations and defines the consequences of failure.

SLI (the measurement) informs your SLO (the internal goal), which in turn informs your SLA (the external promise).

So, the next time your service feels "slow," don't just panic. Ask the right questions:

What is the SLI we should be looking at? (e.g., p95 latency).
What is our SLO for that indicator? (e.g., 200ms).
Are we about to breach our SLA? (e.g., the contract promises < 1000ms).

By using this framework, you transform chaos into a structured, data-driven conversation about reliability. And that means less 3 AM panic-calls and more happy, pizza-eating customers. Or, you know, users of your app.

AIOps for Humans: What It Is, Why DevOps Loves It, and How to Start

A friendly, beginner-friendly guide to AIOps: its core ideas, how it helps DevOps, real use cases, simple code examples, and a practical roadmap to get started.

December 20, 2025

Event-driven Architecture and DevOps Explained Simply

Clear, friendly guide to what event-driven architecture and DevOps are, what problems they solve, how they work, plus simple code and pipeline examples for programmers.

December 19, 2025

Don't Use Kubernetes! (Unless You Really, Really Should)

Everyone's talking about Kubernetes, but is it always the right choice? Let's explore with humor and examples when this powerful tool is actually overkill and what you should use instead.

December 6, 2025

Kubernetes: The Overworked Restaurant Manager Your Code Desperately Needs

Ever wondered why everyone's talking about Kubernetes? We break down this DevOps giant by comparing it to a chaotic but brilliant restaurant manager. No jargon, just fun analogies and code!

November 29, 2025