Keeping Your Promises: A Fun Guide to SLIs, SLOs, and SLAs
So, Your App is Down... Again.
We’ve all been there. It’s 3 AM, your phone is buzzing like an angry hornet, and the only thing you can think is, "Oh no, what did I break this time?" Users are complaining on Twitter that your service is "slow," "broken," or "a potato-powered hamster wheel."
But what does "slow" even mean? How much "broken" is too much? How do you turn vague user rage into actionable engineering goals?
Fear not, my friend! Enter the three musketeers of reliability: SLIs, SLOs, and SLAs. They sound like a law firm, but I promise they're way more fun. To make this easy, let's imagine we're not running a web service, but the world's most critical enterprise: a pizza delivery service.
SLI: The Thermometer (What are we measuring?)
An SLI, or Service Level Indicator, is simply a measurement. It’s a number. It’s the raw, unfiltered truth about your service's performance. It doesn't have an opinion; it just states facts.
For our pizza joint, an SLI could be:
- Pizza Delivery Time: How long did it take from order to doorstep? (e.g., 27 minutes)
- Pizza Temperature: Was the pizza hot on arrival? (e.g., 150°F)
- Order Accuracy: Did the customer get the pepperoni they asked for, or did they get pineapple and despair? (e.g., 99.8% of orders are correct)
In the tech world, common SLIs are:
- Availability: Is the service responding to requests? (e.g., the percentage of successful HTTP responses, i.e., not 5xx errors).
- Latency: How long does it take to get a response? (e.g., the time to process an API request).
- Throughput: How many requests can we handle per second?
An SLI is something you can measure. If you can't put a number on it, it's not an SLI.
Let's look at a simple code example. Imagine you have a log of your API's HTTP status codes. You can calculate your availability SLI like this:
pythondef calculate_availability_sli(status_codes): """Calculates the percentage of successful requests (not 5xx).""" total_requests = len(status_codes) if total_requests == 0: return 100.0 # No requests, perfect availability? Let's say yes! successful_requests = sum(1 for code in status_codes if code < 500) availability_percentage = (successful_requests / total_requests) * 100 return availability_percentage # A log of requests from the last 5 minutes requests_log = [200, 200, 503, 201, 404, 200, 500, 200] availability = calculate_availability_sli(requests_log) print(f"Our availability SLI is currently: {availability:.2f}%") # Output: Our availability SLI is currently: 75.00%
See? The SLI is just a number: 75.00%. It doesn't say if that's good or bad. It's just... a fact.
SLO: The Promise to Ourselves (What's our goal?)
An SLO, or Service Level Objective, is the goal you set for your SLI. It's where you draw a line in the sand and say, "This is what we consider a good performance."
This is an internal target. It's a promise you make to your team to keep yourselves honest and your users happy.
Back to our pizza shop:
- SLO for Delivery Time: We will deliver 99% of our pizzas in under 30 minutes.
Notice the specifics. It's not just "we'll be fast." It's a precise target (99%) for an SLI (delivery time) over a certain threshold (< 30 mins).
In tech:
- SLO for Availability: The
calculate_availability_slifunction will return at least 99.9% over a rolling 28-day window. - SLO for Latency: 95% of all API requests to the
/usersendpoint will be served in under 200ms.
The Magic of the Error Budget
SLOs give us something amazing: an Error Budget. If your availability SLO is 99.9%, it means you have a 0.1% budget for failure.
Error Budget = 100% - SLO
This is liberating! It means you don't have to be perfect. You can use this budget to take calculated risks, like deploying new features or performing risky maintenance. As long as you're within budget, you can innovate. If you burn through your budget, all hands on deck! It's time for a feature freeze to focus on stability.
python# Let's check if we are meeting our SLO SLO_TARGET = 99.9 current_availability_sli = 99.95 # We are doing great! error_budget_remaining = current_availability_sli - SLO_TARGET if error_budget_remaining >= 0: print(f"Hooray! We are meeting our SLO. We have {error_budget_remaining:.2f}% of budget left. Let's ship that new feature!") else: print(f"ALERT! We have burned our error budget by {-error_budget_remaining:.2f}%. Freeze all deploys! Fix the things!")
SLA: The Pinky Promise with Consequences (What if we fail?)
An SLA, or Service Level Agreement, is the promise you make to your customers. It's a formal contract, often with financial penalties if you fail to meet it.
An SLA is basically a simplified, more lenient version of your SLO. You give yourself a buffer because breaking an SLA costs you real money or reputation.
For our pizza empire:
- SLA: If your pizza is not delivered in 45 minutes, you get it for free.
See the difference? Our internal goal (SLO) is 30 minutes. But the external promise (SLA) is 45 minutes. This buffer zone protects us. We might miss our internal goal and have a team meeting about it, but we hopefully won't miss the customer-facing promise and have to give away free pizza.
In the tech world:
- SLA: We guarantee 99.5% uptime per billing cycle. If we fail, customers will receive a 10% credit on their next bill.
Notice that the SLA uptime (99.5%) is lower than our internal SLO target (99.9%). This is crucial. Your SLO should always be stricter than your SLA.
SLO > SLA. Always.
If you start breaking your SLO, it's an early warning sign that your SLA might be in danger. It's the canary in the coal mine.
Tying It All Together
Let's recap the relationship with our pizza analogy:
- You measure everything (SLI): The delivery time for every single pizza is logged. This is your data.
- You set an internal goal (SLO): To keep the quality high and the team sharp, you aim to have 99% of deliveries under 30 minutes.
- You make a public promise (SLA): You tell your customers, "Guaranteed in 45 minutes or it's free!" This manages their expectations and defines the consequences of failure.
SLI (the measurement) informs your SLO (the internal goal), which in turn informs your SLA (the external promise).
So, the next time your service feels "slow," don't just panic. Ask the right questions:
- What is the SLI we should be looking at? (e.g., p95 latency).
- What is our SLO for that indicator? (e.g., 200ms).
- Are we about to breach our SLA? (e.g., the contract promises < 1000ms).
By using this framework, you transform chaos into a structured, data-driven conversation about reliability. And that means less 3 AM panic-calls and more happy, pizza-eating customers. Or, you know, users of your app.
Related Articles
Why Your Devs and Ops Teams Are Fighting (And How SRE Can Be Their Marriage Counselor)
Ever felt the tension between shipping new features fast and keeping the system stable? Dive into Site Reliability Engineering (SRE), the discipline that turns this tug-of-war into a beautiful dance using data, automation, and a little thing called an 'error budget'.
DevOps vs. SRE: The Ultimate Showdown (or is it a Bromance?)
Confused about DevOps and SRE? We break down the 'rivalry' with a fun restaurant analogy, code examples, and a healthy dose of humor to show you how they're actually best friends.
SRE Explained: Or, How I Learned to Stop Worrying and Love the Pager
Tired of the 3 AM pager alerts and the endless war between Devs and Ops? Discover Site Reliability Engineering (SRE), the secret sauce from Google that turns operational chaos into automated harmony.