Acceptable Latency Thresholds

From RTT Numbers to Timer Budgeting and Failure Domains

“Acceptable latency” is not a single RTT value. It is the result of how much time each protocol step is allowed to consume before a timer expires.

This document reframes latency as transaction budgeting, mapped to real authentication and posture flows.

The objective is to clearly explain:

Where latency is actually consumed
How protocol timers interact and compound
What is acceptable by design
What is out of standard and likely to break services
Why increasing timers hides problems and increases risk
How topology choices directly affect success or failure

Latency thresholds only make sense when mapped to protocol timers and topology.

1. Latency Is a Budget, Not a Metric

Authentication is never a single hop.

A typical enterprise authentication is a composed transaction:

Endpoint ↔ NAD (802.1X, VPN, or WebAuth)
NAD ↔ ISE (RADIUS over UDP, EAP)
ISE ↔ DNS (SRV, A, PTR lookups)
ISE ↔ AD/DC (Kerberos and/or LDAP)
ISE ↔ PKI (CRL / OCSP, for certificate-based auth)
Endpoint ↔ ISE posture services (if posture is enabled)

The effective authentication time is:

Component

Description

RTT × protocol round trips

Network round-trip time multiplied by the number of protocol exchanges (e.g., RADIUS, LDAP, Kerberos)

Backend response times

Response latency from backend services such as AD, DNS, PKI, and posture services

Processing time

Internal processing time for authentication, authorization, and policy evaluation

Jitter and queueing

Variability caused by network jitter, congestion, and request queueing

If any timer expires at any point in this chain, the system does not fail gracefully by default. Instead, it compensates through mechanisms that increase load, complexity, and instability:

Retries (RADIUS retransmissions, EAP re-initiations)
Queueing (authentication and posture backlogs)
Session duplication (parallel transactions for the same endpoint)
Fail-open or fallback behaviors (in some designs)
Posture “unknown” or compliance loops

These compensations are not independent — they reinforce each other and can quickly push the system into a degraded state.

2. The Practical Timer Stack

Authentication and posture are governed by multiple independent timers, each with its own tolerance window. Latency becomes dangerous when these windows overlap or expire out of order.

Each layer has a different concept of “patience”.

2.1 NAD Layer (RADIUS over UDP)

From the NAD’s perspective, the RADIUS timer is a hard boundary.

Defined by:
- RADIUS timeout value
- Number of retries
When the timer expires:
- The request is retransmitted, or
- The session fails

Important characteristics:

RADIUS is stateless over UDP
Retries are blind — the NAD does not know whether ISE is still processing
Each retry increases:
- Concurrent sessions
- Backend load
- Probability of queueing

A single expired RADIUS timer often triggers multiple downstream effects, not just a retry.

2.2 EAP Layer

EAP is not a single exchange.

Most EAP methods require multiple request/response cycles
Each cycle consumes at least one RTT
Total EAP time grows linearly with latency

Implications:

Higher RTT multiplies total authentication time
Certificate-based EAP (EAP-TLS, TEAP) is especially sensitive
EAP delays can exist even when RADIUS timers have not yet expired

EAP transforms “moderate latency” into “significant total delay” through repetition.

2.3 Identity Store Layer (AD / LDAP / Kerberos)

Identity resolution often dominates authentication time.

Typical operations include:

DNS SRV resolution for DC selection
Kerberos ticket validation
LDAP bind and group membership queries

Key risks:

Remote or mis-selected DCs dramatically increase latency
Kerberos is particularly sensitive to

3. A Latency Budgeting Model (Simple and Practical)

Assume a common NAD configuration:

RADIUS Timeout: 5 seconds
Retries: 2

From a strict protocol perspective, this might suggest a total tolerance of ~15 seconds. In practice, this interpretation is misleading.

A more accurate model is:

Single-attempt budget: ~5 seconds
Retries: safety mechanisms, not usable capacity

Retries do not extend the usable authentication window. They introduce bursts, duplication, and contention.

Why retries are not free:

Each retry increases the number of concurrent authentication sessions
Each retry increases backend workload (DNS, Kerberos, LDAP, PKI)
Each retry increases queueing inside the policy engine
Under load, retries increase the probability of synchronized timeouts (“timeout storms”)

Retries are designed to absorb packet loss, not systemic latency.

3.1 Design Goal

Keep the 95th / 99th percentile end-to-end authentication time comfortably below the first RADIUS timeout.

If retries are being used regularly, the system is already operating outside its intended design envelope.

4. What “Acceptable” Looks Like (Threshold Guidance)

The following thresholds are design starting points, not guarantees. They assume healthy infrastructure, correct site-aware configuration, and normal load.

Segment / Dependency

Recommended (95th %)

Risk Zone

Typical Symptoms

NAD ↔ ISE RTT (RADIUS)

< 50 ms

> 100 ms

RADIUS retries, auth latency alarms

ISE ↔ DNS resolver

< 20 ms

> 100 ms

“Random” auth slowness, DC mis-selection

ISE ↔ selected DC/KDC RTT

< 50–100 ms

> 150–200 ms

Kerberos delay/failure, LDAP slowness

DNS SRV resolution (site-aware)

consistent, local

remote answers

unpredictable RTT, wrong DC choice

PKI revocation (CRL / OCSP)

local or cached

remote

EAP-TLS delays, cert instability

Posture module ↔ ISE services

< 150 ms

> 300 ms

posture unknown, quarantine loops

4.1 Why These Numbers Matter

Authentication and posture are multiplicative, not additive.

A single EAP method may require multiple RTTs
A single LDAP group lookup may involve multiple sequential queries
A single certificate revocation check can consume seconds if remote or slow

Latency that looks harmless in isolation can consume the entire timer budget when repeated across protocol steps.

Acceptable latency is defined by how much of the timer budget remains unused, not by raw RTT alone.

5. “Good” Topology (Acceptable Latency) — Mermaid

This topology is typically stable because it keeps policy and identity locality aligned. The result is predictable RTT, fewer retries, and healthy timer margins.

5.1 Why This Works — Metric-Driven View (Total Cost Perspective)

This topology works not only conceptually, but measurably.

From an operational perspective (Cisco ISE and similar policy engines), what matters is the total end-to-end authentication cost, not individual RTT values in isolation.

Practical Total Cost Targets (ISE-Oriented)

In a healthy, well-designed topology, typical observations are:

End-to-end authentication time (p95): < 2 seconds
End-to-end authentication time (p99): < 3 seconds
ISE “High Authentication Latency” alarms (> 500 ms):
- Rare
- Transient
- Not correlated across large numbers of sessions

The commonly observed 500 ms “high latency” marker in ISE should be treated as a warning threshold, not a failure point.

How the Budget Is Consumed (Conceptual)

In a “good” topology, the budget is typically distributed as:

Component

Typical Contribution

NAD ↔ ISE (RADIUS/EAP RTT cycles)

100–300 ms

DNS resolution (SRV, A, PTR)

< 50 ms

Kerberos validation

100–300 ms

LDAP identity and group lookup

200–500 ms

ISE processing and policy evaluation

200–400 ms

Total (typical p95)

~1–2 seconds

This leaves significant margin before the first RADIUS timeout is approached.

Why 500 ms Matters — and Why It Is Not the Whole Story

ISE flags “high latency” when individual transactions exceed ~500 ms, but:

A single slow backend call does not necessarily break authentication
Multiple slow calls stack and compound
Repeated 500 ms events often indicate:
- Remote DC selection
- DNS indirection
- PKI or posture latency
- Early-stage queueing

Repeated “500 ms events” are an early signal that the system is burning its timer budget, even if authentications are still succeeding.

Acceptable vs. Risky (Total Cost View)

From a total-cost perspective:

Acceptable / Healthy
- Most authentications < 2 seconds
- p99 comfortably < 3 seconds
- RADIUS retries are rare
- High-latency events are isolated
At Risk
- Auth times frequently > 3 seconds
- Regular high-latency alarms
- Visible retries under moderate load
- Posture entering “Unknown” intermittently
Unstable
- Auth times competing with RADIUS timeout (5s)
- Burst retries during peaks
- Queue growth and CPU spikes
- Posture loops or fallback behavior

Key Operational Insight

A topology is “good” not because nothing ever exceeds 500 ms, but because the total authentication cost consistently stays far away from protocol timers.

In well-designed environments, latency spikes remain noise. In poorly designed ones, they become the dominant execution path.

5.2 Interpreting High Authentication Latency in ISE (≈ 2800 ms Scenario)

In some deployments, operators may observe authentication latency values clustering around ~2800 ms (2.8 seconds) for individual authentication transactions.

This range represents a fundamentally different operating condition compared to the commonly referenced ~500 ms warning level.

At ~2800 ms, the system is no longer consuming margin — it is actively competing with protocol timers.

This value reflects an aggregated end-to-end authentication cost, typically including:

Multiple RADIUS and EAP exchanges
Identity store operations (DNS, Kerberos, LDAP)
Policy evaluation
Internal queueing delays

At this point, latency is no longer incidental; it is structural.

5.3 What ~2800 ms Really Means

In practical terms, an authentication latency around ~2800 ms implies:

A large portion of the first RADIUS timeout is already consumed
Any additional delay (jitter, backend slowdown, load spike) can push transactions into retry territory
Successful authentications may still occur, but timing safety margins are nearly exhausted

~2800 ms is not a warning. It is an indication that the system is operating inside the failure envelope.

5.4 System Behavior at This Level

In environments where ~2800 ms latency is observed with regularity, common characteristics include:

RADIUS retries occurring intermittently
Authentication success dependent on load conditions
Increased variance between identical authentication attempts
Early signs of queue buildup
Posture increasingly entering “Unknown” or delayed states

From the user perspective, this may still appear as “working”, but from a system perspective, execution is already fragile.

5.5 Mapping ~2800 ms to Total Cost and Risk

Metric

Acceptable

Degraded

Critical

Typical auth latency

< 300 ms

300–500 ms

~2800 ms frequent

p95 auth latency

< 2 s

2–3 s

≥ 3 s

p99 auth latency

< 3 s

3–5 s

Competes with RADIUS timeout

RADIUS retries

Rare

Periodic

Frequent

Queue growth

None

Transient

Sustained

Failure sensitivity

Low

Moderate

High

At ~2800 ms, even minor disturbances can trigger:

Retry synchronization
Rapid queue expansion
CPU and memory pressure
Cascading authentication failures

5.6 Why This Condition Often Precedes Sudden Outages

A common pattern at this stage is:

Authentication appears mostly successful
Latency remains consistently high
Timers are increased to “stabilize” the system
Load increases or a backend dependency slows slightly
Retries align across sessions
Queues spike
Authentication collapses rapidly

Systems operating at ~2800 ms are often one small incident away from outage.

5.7 Correct Design Interpretation

From a design and architecture perspective:

~2800 ms is outside acceptable operating range
It indicates excessive dependency latency, queueing, or topology misalignment
The problem is no longer tuning — it is architecture and locality

A healthy target remains:

Most authentications < 2 seconds
p99 comfortably below the first RADIUS timeout
Retries as exceptions, not a dependency

A stable design keeps authentication latency far from timer boundaries, not merely below them.

Previous02-protocols-and-latency NextRADIUS + EAP Under Latency

Last updated 2 hours ago

hashtagFrom RTT Numbers to Timer Budgeting and Failure Domains

hashtag1. Latency Is a Budget, Not a Metric

hashtag2. The Practical Timer Stack

hashtag2.1 NAD Layer (RADIUS over UDP)

hashtag2.2 EAP Layer

hashtag2.3 Identity Store Layer (AD / LDAP / Kerberos)

hashtag3. A Latency Budgeting Model (Simple and Practical)

hashtag3.1 Design Goal

hashtag4. What “Acceptable” Looks Like (Threshold Guidance)

hashtag4.1 Why These Numbers Matter

hashtag5. “Good” Topology (Acceptable Latency) — Mermaid

hashtag5.1 Why This Works — Metric-Driven View (Total Cost Perspective)

hashtagPractical Total Cost Targets (ISE-Oriented)

hashtagHow the Budget Is Consumed (Conceptual)

hashtagWhy 500 ms Matters — and Why It Is Not the Whole Story

hashtagAcceptable vs. Risky (Total Cost View)

hashtagKey Operational Insight

hashtag5.2 Interpreting High Authentication Latency in ISE (≈ 2800 ms Scenario)

hashtag5.3 What ~2800 ms Really Means

hashtag5.4 System Behavior at This Level

hashtag5.5 Mapping ~2800 ms to Total Cost and Risk

hashtag5.6 Why This Condition Often Precedes Sudden Outages

hashtag5.7 Correct Design Interpretation