SLA vs. SLO vs. SLI: What Every Developer Should Know
SLA, SLO, and SLI are often confused but serve different purposes. Learn the differences, how they relate, and how to implement them for your services.
Someone on your team says "we need an SLA." Someone else says "you mean an SLO." A third person asks "what is our SLI?" Everyone nods as if they understand, but nobody is sure they are talking about the same thing.
SLA, SLO, and SLI are three related but distinct concepts that every developer working on production services should understand. Getting them wrong leads to over-promising to customers, under-investing in reliability, or building monitoring that measures the wrong things. Here is a clear breakdown.
SLI: Service Level Indicator
What it is: A quantitative measurement of some aspect of the service you provide.
An SLI is a number. It is something you measure. It answers the question: "How is our service performing right now?"
Common SLIs include:
Availability: The percentage of requests that return a successful response (e.g., 99.95% of requests return 2xx status codes)
Latency: The time it takes to process a request (e.g., 95th percentile response time is 200ms)
Error rate: The percentage of requests that result in errors (e.g., 0.1% of requests return 5xx)
Throughput: The number of requests the service can handle (e.g., 10,000 requests per second)
The key to a good SLI is that it measures something your users actually care about. Internal CPU usage is a metric, but it is not an SLI because your users do not directly experience CPU usage. They experience slow page loads and error messages.
Best practice: Choose 3-5 SLIs that directly reflect user experience. More than that and you lose focus. Fewer than that and you miss important dimensions.
SLO: Service Level Objective
What it is: A target value or range for an SLI.
An SLO is a goal. It answers the question: "How well should our service perform?"
If your SLI is availability (measured as percentage of successful requests), your SLO might be: "99.9% of requests will return successfully over a 30-day rolling window."
If your SLI is latency, your SLO might be: "95% of requests will complete in under 300ms."
SLOs are internal targets that your team sets and owns. They are not contractual obligations to customers (that is the SLA, which comes next). They are the engineering team's definition of "good enough."
How to Set SLOs
Setting the right SLO requires balancing user expectations with engineering reality:
Too aggressive (99.99%): Your team spends all their time maintaining reliability and cannot ship features. Any minor issue becomes a fire drill.
Too lenient (99%): Users experience enough errors and slowness that they lose trust in your product.
Just right (99.9% for most SaaS): Allows for approximately 43 minutes of downtime per month. Users experience reliable service while the engineering team has room to ship and iterate.
The right SLO depends on your service and your users. A payment processing API might need 99.99%. A blog might be fine at 99.5%.
Error Budgets
The gap between your SLO and 100% is your error budget. If your SLO is 99.9% availability, your error budget is 0.1% — roughly 43 minutes of downtime per month.
As long as you are within your error budget, you can take risks: ship new features, run experiments, perform infrastructure migrations. If you exhaust your error budget, you slow down and focus on reliability.
This is one of the most powerful concepts in SRE. It turns reliability from a vague "we should be more careful" into a concrete, measurable budget that balances innovation with stability.
SLA: Service Level Agreement
What it is: A contractual commitment to your customers about the level of service you will provide, including consequences if you fail to meet it.
An SLA is a legal document. It answers the question: "What do we promise our customers, and what happens if we break that promise?"
An SLA typically includes:
The specific metrics and thresholds you commit to (based on your SLOs)
How those metrics are measured (based on your SLIs)
What happens when the SLA is breached (usually service credits or refunds)
Exclusions (scheduled maintenance, force majeure, customer-caused issues)
Critical rule: Your SLA should always be less aggressive than your SLO. If your internal SLO is 99.9%, your external SLA might be 99.5%. This buffer protects you from owing credits every time you have a minor incident.
SLA Breach Consequences
Most SaaS SLAs offer service credits when uptime falls below the committed threshold:
| Monthly Uptime | Credit |
|---|---|
| 99.0% - 99.9% | 10% credit |
| 95.0% - 99.0% | 25% credit |
| Below 95.0% | 50% credit |
These are not just financial costs. An SLA breach triggers customer conversations, leadership reviews, and potentially legal action for enterprise contracts.
How They Work Together
The relationship flows like this:
1. You choose SLIs that measure what users care about (availability, latency, errors)
2. You set SLOs as internal targets for those SLIs (99.9% availability)
3. You commit to SLAs with customers based on your SLOs with a buffer (99.5% availability)
4. You monitor SLIs in real time to track whether you are meeting your SLOs
5. You use error budgets to balance feature development with reliability investment
When your monitoring shows your SLI approaching your SLO threshold, you know it is time to invest in reliability before you breach your SLA.
Implementing SLIs, SLOs, and SLAs in Practice
Step 1: Choose Your SLIs
Pick 3-5 metrics that reflect user experience. For a typical web application:
Availability (successful request ratio)
Latency (p50, p95, p99 response times)
Error rate (5xx responses)
Step 2: Set Internal SLOs
Start conservative and adjust based on data. If you do not know your current reliability, measure for 30 days before setting an SLO. Set your SLO at or slightly above your actual observed performance.
Step 3: Build Monitoring
You cannot manage what you do not measure. Set up monitoring that tracks your SLIs in real time and alerts you when you are consuming your error budget too quickly.
StatusShield provides exactly this: real-time availability monitoring, response time tracking, historical uptime data, and alerting when things degrade. Your status page shows customers your actual uptime, which builds trust and provides the data you need for SLA reporting.
Step 4: Publish SLAs (If Appropriate)
Not every product needs a formal SLA. For enterprise customers who require contractual commitments, publish an SLA with thresholds below your SLOs. For smaller products, a public status page showing historical uptime can serve a similar trust-building purpose without the legal overhead.
Start Measuring Today
You cannot set meaningful SLOs without data. You cannot honor SLAs without monitoring. And you cannot improve reliability without tracking SLIs over time.
Start monitoring with StatusShield and get the uptime data, response time metrics, and historical tracking you need to define and meet your service level commitments. Free plan includes 3 monitors and a public status page.