Why Reliability Needs a Common Language
When something goes wrong in production, vague statements like "the system was slow" or "there were some errors" don't help teams prioritize, communicate with stakeholders, or make principled engineering trade-offs. Service Level Indicators, Objectives, and Agreements give teams a shared, precise vocabulary for talking about reliability.
These three concepts are core to Site Reliability Engineering (SRE) and are increasingly adopted by DevOps and platform teams as systems mature.
SLI: Service Level Indicator
An SLI is a specific, quantitative measure of some aspect of your system's behavior. It's a metric — a number you can actually measure.
Common SLIs include:
- Availability: the proportion of time the service is serving requests successfully (e.g., HTTP 2xx responses as a percentage of all requests)
- Latency: the proportion of requests served within a defined time threshold (e.g., 95th percentile response time under 300ms)
- Error rate: the percentage of requests that result in errors
- Throughput: the number of requests processed per second
- Freshness: for data pipelines, how up-to-date the data is
Choose SLIs that genuinely reflect the user experience. An SLI that looks green while users are suffering is worse than having no SLI at all.
SLO: Service Level Objective
An SLO is a target value or range for an SLI. It expresses what "good enough" looks like over a specified time window.
Examples:
- "99.5% of requests will be served successfully over a rolling 30-day window."
- "95% of API responses will complete in under 200ms."
- "The data pipeline will have no more than 4 hours of staleness at any time."
SLOs are internal targets that drive engineering decisions. They define when you have budget to ship new features versus when reliability work must take priority.
The Error Budget
The gap between 100% and your SLO target is your error budget. A 99.5% availability SLO gives you a 0.5% error budget — roughly 3.6 hours of allowable downtime in a 30-day window.
Error budgets are the mechanism that makes SLOs actionable:
- If your error budget is healthy (plenty remaining), you can move fast and take more deployment risk
- If your error budget is nearly exhausted, reliability work takes precedence over new features
- This transforms reliability from a debate into a data-driven conversation
SLA: Service Level Agreement
An SLA is a contractual commitment, typically made to external customers, with defined consequences if the commitment is breached. SLAs are usually more lenient than internal SLOs — if you miss your SLO, you investigate. If you miss your SLA, you may owe refunds or face contract penalties.
| SLI | SLO | SLA | |
|---|---|---|---|
| What it is | A metric | A target for a metric | A contract with consequences |
| Audience | Internal engineering | Internal teams | External customers |
| Consequence of breach | None (it's data) | Engineering priority shift | Legal/financial penalty |
| Who sets it | Engineers | Engineering + Product | Business + Legal |
How to Define Good SLOs
- Start with user journeys. What are the critical things your users do? Define SLOs around those flows, not arbitrary technical metrics.
- Use real data. Look at your historical performance before setting a target. Setting an SLO you've never actually met creates immediate false failure.
- Set aspirational but achievable targets. 99.99% is not the right target for most services — it leaves almost no error budget and makes every small incident feel catastrophic.
- Define the measurement window. Rolling 30-day windows are most common. They're continuous and reflect recent performance rather than calendar-based resets.
- Review SLOs regularly. As systems and user expectations evolve, your SLOs should too.
Putting It Into Practice
Start small: pick one or two critical user-facing services, define clear SLIs, agree on SLO targets with your product and business stakeholders, and start measuring. Build dashboards that show current SLO compliance and remaining error budget visually. Make the data visible in team standups and planning sessions.
The goal isn't perfect uptime — it's a shared understanding of what "reliable enough" means, and the ability to make rational trade-offs between speed and stability based on real data.