Acceptable Latency Thresholds
From RTT Numbers to Timer Budgeting and Failure Domains
“Acceptable latency” is not a single RTT value. It is the result of how much time each protocol step is allowed to consume before a timer expires.
This document reframes latency as transaction budgeting, mapped to real authentication and posture flows.
The objective is to clearly explain:
Where latency is actually consumed
How protocol timers interact and compound
What is acceptable by design
What is out of standard and likely to break services
Why increasing timers hides problems and increases risk
How topology choices directly affect success or failure
Latency thresholds only make sense when mapped to protocol timers and topology.
1. Latency Is a Budget, Not a Metric
Authentication is never a single hop.
A typical enterprise authentication is a composed transaction:
Endpoint ↔ NAD (802.1X, VPN, or WebAuth)
NAD ↔ ISE (RADIUS over UDP, EAP)
ISE ↔ DNS (SRV, A, PTR lookups)
ISE ↔ AD/DC (Kerberos and/or LDAP)
ISE ↔ PKI (CRL / OCSP, for certificate-based auth)
Endpoint ↔ ISE posture services (if posture is enabled)
The effective authentication time is:
RTT × protocol round trips
Network round-trip time multiplied by the number of protocol exchanges (e.g., RADIUS, LDAP, Kerberos)
Backend response times
Response latency from backend services such as AD, DNS, PKI, and posture services
Processing time
Internal processing time for authentication, authorization, and policy evaluation
Jitter and queueing
Variability caused by network jitter, congestion, and request queueing
If any timer expires at any point in this chain, the system does not fail gracefully by default. Instead, it compensates through mechanisms that increase load, complexity, and instability:
Retries (RADIUS retransmissions, EAP re-initiations)
Queueing (authentication and posture backlogs)
Session duplication (parallel transactions for the same endpoint)
Fail-open or fallback behaviors (in some designs)
Posture “unknown” or compliance loops
These compensations are not independent — they reinforce each other and can quickly push the system into a degraded state.
2. The Practical Timer Stack
Authentication and posture are governed by multiple independent timers, each with its own tolerance window. Latency becomes dangerous when these windows overlap or expire out of order.
Each layer has a different concept of “patience”.
2.1 NAD Layer (RADIUS over UDP)
From the NAD’s perspective, the RADIUS timer is a hard boundary.
Defined by:
RADIUS timeout value
Number of retries
When the timer expires:
The request is retransmitted, or
The session fails
Important characteristics:
RADIUS is stateless over UDP
Retries are blind — the NAD does not know whether ISE is still processing
Each retry increases:
Concurrent sessions
Backend load
Probability of queueing
A single expired RADIUS timer often triggers multiple downstream effects, not just a retry.
2.2 EAP Layer
EAP is not a single exchange.
Most EAP methods require multiple request/response cycles
Each cycle consumes at least one RTT
Total EAP time grows linearly with latency
Implications:
Higher RTT multiplies total authentication time
Certificate-based EAP (EAP-TLS, TEAP) is especially sensitive
EAP delays can exist even when RADIUS timers have not yet expired
EAP transforms “moderate latency” into “significant total delay” through repetition.
2.3 Identity Store Layer (AD / LDAP / Kerberos)
Identity resolution often dominates authentication time.
Typical operations include:
DNS SRV resolution for DC selection
Kerberos ticket validation
LDAP bind and group membership queries
Key risks:
Remote or mis-selected DCs dramatically increase latency
Kerberos is particularly sensitive to
3. A Latency Budgeting Model (Simple and Practical)
Assume a common NAD configuration:
RADIUS Timeout: 5 seconds
Retries: 2
From a strict protocol perspective, this might suggest a total tolerance of ~15 seconds. In practice, this interpretation is misleading.
A more accurate model is:
Single-attempt budget: ~5 seconds
Retries: safety mechanisms, not usable capacity
Retries do not extend the usable authentication window. They introduce bursts, duplication, and contention.
Why retries are not free:
Each retry increases the number of concurrent authentication sessions
Each retry increases backend workload (DNS, Kerberos, LDAP, PKI)
Each retry increases queueing inside the policy engine
Under load, retries increase the probability of synchronized timeouts (“timeout storms”)
Retries are designed to absorb packet loss, not systemic latency.
3.1 Design Goal
Keep the 95th / 99th percentile end-to-end authentication time comfortably below the first RADIUS timeout.
If retries are being used regularly, the system is already operating outside its intended design envelope.
4. What “Acceptable” Looks Like (Threshold Guidance)
The following thresholds are design starting points, not guarantees. They assume healthy infrastructure, correct site-aware configuration, and normal load.
NAD ↔ ISE RTT (RADIUS)
< 50 ms
> 100 ms
RADIUS retries, auth latency alarms
ISE ↔ DNS resolver
< 20 ms
> 100 ms
“Random” auth slowness, DC mis-selection
ISE ↔ selected DC/KDC RTT
< 50–100 ms
> 150–200 ms
Kerberos delay/failure, LDAP slowness
DNS SRV resolution (site-aware)
consistent, local
remote answers
unpredictable RTT, wrong DC choice
PKI revocation (CRL / OCSP)
local or cached
remote
EAP-TLS delays, cert instability
Posture module ↔ ISE services
< 150 ms
> 300 ms
posture unknown, quarantine loops
4.1 Why These Numbers Matter
Authentication and posture are multiplicative, not additive.
A single EAP method may require multiple RTTs
A single LDAP group lookup may involve multiple sequential queries
A single certificate revocation check can consume seconds if remote or slow
Latency that looks harmless in isolation can consume the entire timer budget when repeated across protocol steps.
Acceptable latency is defined by how much of the timer budget remains unused, not by raw RTT alone.
5. “Good” Topology (Acceptable Latency) — Mermaid
This topology is typically stable because it keeps policy and identity locality aligned. The result is predictable RTT, fewer retries, and healthy timer margins.
5.1 Why This Works — Metric-Driven View (Total Cost Perspective)
This topology works not only conceptually, but measurably.
From an operational perspective (Cisco ISE and similar policy engines), what matters is the total end-to-end authentication cost, not individual RTT values in isolation.
Practical Total Cost Targets (ISE-Oriented)
In a healthy, well-designed topology, typical observations are:
End-to-end authentication time (p95): < 2 seconds
End-to-end authentication time (p99): < 3 seconds
ISE “High Authentication Latency” alarms (> 500 ms):
Rare
Transient
Not correlated across large numbers of sessions
The commonly observed 500 ms “high latency” marker in ISE should be treated as a warning threshold, not a failure point.
How the Budget Is Consumed (Conceptual)
In a “good” topology, the budget is typically distributed as:
NAD ↔ ISE (RADIUS/EAP RTT cycles)
100–300 ms
DNS resolution (SRV, A, PTR)
< 50 ms
Kerberos validation
100–300 ms
LDAP identity and group lookup
200–500 ms
ISE processing and policy evaluation
200–400 ms
Total (typical p95)
~1–2 seconds
This leaves significant margin before the first RADIUS timeout is approached.
Why 500 ms Matters — and Why It Is Not the Whole Story
ISE flags “high latency” when individual transactions exceed ~500 ms, but:
A single slow backend call does not necessarily break authentication
Multiple slow calls stack and compound
Repeated 500 ms events often indicate:
Remote DC selection
DNS indirection
PKI or posture latency
Early-stage queueing
Repeated “500 ms events” are an early signal that the system is burning its timer budget, even if authentications are still succeeding.
Acceptable vs. Risky (Total Cost View)
From a total-cost perspective:
Acceptable / Healthy
Most authentications < 2 seconds
p99 comfortably < 3 seconds
RADIUS retries are rare
High-latency events are isolated
At Risk
Auth times frequently > 3 seconds
Regular high-latency alarms
Visible retries under moderate load
Posture entering “Unknown” intermittently
Unstable
Auth times competing with RADIUS timeout (5s)
Burst retries during peaks
Queue growth and CPU spikes
Posture loops or fallback behavior
Key Operational Insight
A topology is “good” not because nothing ever exceeds 500 ms, but because the total authentication cost consistently stays far away from protocol timers.
In well-designed environments, latency spikes remain noise. In poorly designed ones, they become the dominant execution path.
5.2 Interpreting High Authentication Latency in ISE (≈ 2800 ms Scenario)
In some deployments, operators may observe authentication latency values clustering around ~2800 ms (2.8 seconds) for individual authentication transactions.
This range represents a fundamentally different operating condition compared to the commonly referenced ~500 ms warning level.
At ~2800 ms, the system is no longer consuming margin — it is actively competing with protocol timers.
This value reflects an aggregated end-to-end authentication cost, typically including:
Multiple RADIUS and EAP exchanges
Identity store operations (DNS, Kerberos, LDAP)
Policy evaluation
Internal queueing delays
At this point, latency is no longer incidental; it is structural.
5.3 What ~2800 ms Really Means
In practical terms, an authentication latency around ~2800 ms implies:
A large portion of the first RADIUS timeout is already consumed
Any additional delay (jitter, backend slowdown, load spike) can push transactions into retry territory
Successful authentications may still occur, but timing safety margins are nearly exhausted
~2800 ms is not a warning. It is an indication that the system is operating inside the failure envelope.
5.4 System Behavior at This Level
In environments where ~2800 ms latency is observed with regularity, common characteristics include:
RADIUS retries occurring intermittently
Authentication success dependent on load conditions
Increased variance between identical authentication attempts
Early signs of queue buildup
Posture increasingly entering “Unknown” or delayed states
From the user perspective, this may still appear as “working”, but from a system perspective, execution is already fragile.
5.5 Mapping ~2800 ms to Total Cost and Risk
Typical auth latency
< 300 ms
300–500 ms
~2800 ms frequent
p95 auth latency
< 2 s
2–3 s
≥ 3 s
p99 auth latency
< 3 s
3–5 s
Competes with RADIUS timeout
RADIUS retries
Rare
Periodic
Frequent
Queue growth
None
Transient
Sustained
Failure sensitivity
Low
Moderate
High
At ~2800 ms, even minor disturbances can trigger:
Retry synchronization
Rapid queue expansion
CPU and memory pressure
Cascading authentication failures
5.6 Why This Condition Often Precedes Sudden Outages
A common pattern at this stage is:
Authentication appears mostly successful
Latency remains consistently high
Timers are increased to “stabilize” the system
Load increases or a backend dependency slows slightly
Retries align across sessions
Queues spike
Authentication collapses rapidly
Systems operating at ~2800 ms are often one small incident away from outage.
5.7 Correct Design Interpretation
From a design and architecture perspective:
~2800 ms is outside acceptable operating range
It indicates excessive dependency latency, queueing, or topology misalignment
The problem is no longer tuning — it is architecture and locality
A healthy target remains:
Most authentications < 2 seconds
p99 comfortably below the first RADIUS timeout
Retries as exceptions, not a dependency
A stable design keeps authentication latency far from timer boundaries, not merely below them.
Last updated