Acceptable Latency Thresholds

From RTT Numbers to Timer Budgeting and Failure Domains

“Acceptable latency” is not a single RTT value. It is the result of how much time each protocol step is allowed to consume before a timer expires.

This document reframes latency as transaction budgeting, mapped to real authentication and posture flows.

The objective is to clearly explain:

  • Where latency is actually consumed

  • How protocol timers interact and compound

  • What is acceptable by design

  • What is out of standard and likely to break services

  • Why increasing timers hides problems and increases risk

  • How topology choices directly affect success or failure

Latency thresholds only make sense when mapped to protocol timers and topology.


1. Latency Is a Budget, Not a Metric

Authentication is never a single hop.

A typical enterprise authentication is a composed transaction:

  • Endpoint ↔ NAD (802.1X, VPN, or WebAuth)

  • NAD ↔ ISE (RADIUS over UDP, EAP)

  • ISE ↔ DNS (SRV, A, PTR lookups)

  • ISE ↔ AD/DC (Kerberos and/or LDAP)

  • ISE ↔ PKI (CRL / OCSP, for certificate-based auth)

  • Endpoint ↔ ISE posture services (if posture is enabled)

The effective authentication time is:

Component
Description

RTT × protocol round trips

Network round-trip time multiplied by the number of protocol exchanges (e.g., RADIUS, LDAP, Kerberos)

Backend response times

Response latency from backend services such as AD, DNS, PKI, and posture services

Processing time

Internal processing time for authentication, authorization, and policy evaluation

Jitter and queueing

Variability caused by network jitter, congestion, and request queueing

If any timer expires at any point in this chain, the system does not fail gracefully by default. Instead, it compensates through mechanisms that increase load, complexity, and instability:

  • Retries (RADIUS retransmissions, EAP re-initiations)

  • Queueing (authentication and posture backlogs)

  • Session duplication (parallel transactions for the same endpoint)

  • Fail-open or fallback behaviors (in some designs)

  • Posture “unknown” or compliance loops

These compensations are not independent — they reinforce each other and can quickly push the system into a degraded state.


2. The Practical Timer Stack

Authentication and posture are governed by multiple independent timers, each with its own tolerance window. Latency becomes dangerous when these windows overlap or expire out of order.

Each layer has a different concept of “patience”.


2.1 NAD Layer (RADIUS over UDP)

From the NAD’s perspective, the RADIUS timer is a hard boundary.

  • Defined by:

    • RADIUS timeout value

    • Number of retries

  • When the timer expires:

    • The request is retransmitted, or

    • The session fails

Important characteristics:

  • RADIUS is stateless over UDP

  • Retries are blind — the NAD does not know whether ISE is still processing

  • Each retry increases:

    • Concurrent sessions

    • Backend load

    • Probability of queueing

A single expired RADIUS timer often triggers multiple downstream effects, not just a retry.


2.2 EAP Layer

EAP is not a single exchange.

  • Most EAP methods require multiple request/response cycles

  • Each cycle consumes at least one RTT

  • Total EAP time grows linearly with latency

Implications:

  • Higher RTT multiplies total authentication time

  • Certificate-based EAP (EAP-TLS, TEAP) is especially sensitive

  • EAP delays can exist even when RADIUS timers have not yet expired

EAP transforms “moderate latency” into “significant total delay” through repetition.


2.3 Identity Store Layer (AD / LDAP / Kerberos)

Identity resolution often dominates authentication time.

Typical operations include:

  • DNS SRV resolution for DC selection

  • Kerberos ticket validation

  • LDAP bind and group membership queries

Key risks:

  • Remote or mis-selected DCs dramatically increase latency

  • Kerberos is particularly sensitive to


3. A Latency Budgeting Model (Simple and Practical)

Assume a common NAD configuration:

  • RADIUS Timeout: 5 seconds

  • Retries: 2

From a strict protocol perspective, this might suggest a total tolerance of ~15 seconds. In practice, this interpretation is misleading.

A more accurate model is:

  • Single-attempt budget: ~5 seconds

  • Retries: safety mechanisms, not usable capacity

Retries do not extend the usable authentication window. They introduce bursts, duplication, and contention.

Why retries are not free:

  • Each retry increases the number of concurrent authentication sessions

  • Each retry increases backend workload (DNS, Kerberos, LDAP, PKI)

  • Each retry increases queueing inside the policy engine

  • Under load, retries increase the probability of synchronized timeouts (“timeout storms”)

Retries are designed to absorb packet loss, not systemic latency.

3.1 Design Goal

Keep the 95th / 99th percentile end-to-end authentication time comfortably below the first RADIUS timeout.

If retries are being used regularly, the system is already operating outside its intended design envelope.


4. What “Acceptable” Looks Like (Threshold Guidance)

The following thresholds are design starting points, not guarantees. They assume healthy infrastructure, correct site-aware configuration, and normal load.

Segment / Dependency
Recommended (95th %)
Risk Zone
Typical Symptoms

NAD ↔ ISE RTT (RADIUS)

< 50 ms

> 100 ms

RADIUS retries, auth latency alarms

ISE ↔ DNS resolver

< 20 ms

> 100 ms

“Random” auth slowness, DC mis-selection

ISE ↔ selected DC/KDC RTT

< 50–100 ms

> 150–200 ms

Kerberos delay/failure, LDAP slowness

DNS SRV resolution (site-aware)

consistent, local

remote answers

unpredictable RTT, wrong DC choice

PKI revocation (CRL / OCSP)

local or cached

remote

EAP-TLS delays, cert instability

Posture module ↔ ISE services

< 150 ms

> 300 ms

posture unknown, quarantine loops


4.1 Why These Numbers Matter

Authentication and posture are multiplicative, not additive.

  • A single EAP method may require multiple RTTs

  • A single LDAP group lookup may involve multiple sequential queries

  • A single certificate revocation check can consume seconds if remote or slow

Latency that looks harmless in isolation can consume the entire timer budget when repeated across protocol steps.

Acceptable latency is defined by how much of the timer budget remains unused, not by raw RTT alone.


5. “Good” Topology (Acceptable Latency) — Mermaid

This topology is typically stable because it keeps policy and identity locality aligned. The result is predictable RTT, fewer retries, and healthy timer margins.

spinner

5.1 Why This Works — Metric-Driven View (Total Cost Perspective)

This topology works not only conceptually, but measurably.

From an operational perspective (Cisco ISE and similar policy engines), what matters is the total end-to-end authentication cost, not individual RTT values in isolation.

Practical Total Cost Targets (ISE-Oriented)

In a healthy, well-designed topology, typical observations are:

  • End-to-end authentication time (p95): < 2 seconds

  • End-to-end authentication time (p99): < 3 seconds

  • ISE “High Authentication Latency” alarms (> 500 ms):

    • Rare

    • Transient

    • Not correlated across large numbers of sessions

The commonly observed 500 ms “high latency” marker in ISE should be treated as a warning threshold, not a failure point.


How the Budget Is Consumed (Conceptual)

In a “good” topology, the budget is typically distributed as:

Component
Typical Contribution

NAD ↔ ISE (RADIUS/EAP RTT cycles)

100–300 ms

DNS resolution (SRV, A, PTR)

< 50 ms

Kerberos validation

100–300 ms

LDAP identity and group lookup

200–500 ms

ISE processing and policy evaluation

200–400 ms

Total (typical p95)

~1–2 seconds

This leaves significant margin before the first RADIUS timeout is approached.


Why 500 ms Matters — and Why It Is Not the Whole Story

ISE flags “high latency” when individual transactions exceed ~500 ms, but:

  • A single slow backend call does not necessarily break authentication

  • Multiple slow calls stack and compound

  • Repeated 500 ms events often indicate:

    • Remote DC selection

    • DNS indirection

    • PKI or posture latency

    • Early-stage queueing

Repeated “500 ms events” are an early signal that the system is burning its timer budget, even if authentications are still succeeding.


Acceptable vs. Risky (Total Cost View)

From a total-cost perspective:

  • Acceptable / Healthy

    • Most authentications < 2 seconds

    • p99 comfortably < 3 seconds

    • RADIUS retries are rare

    • High-latency events are isolated

  • At Risk

    • Auth times frequently > 3 seconds

    • Regular high-latency alarms

    • Visible retries under moderate load

    • Posture entering “Unknown” intermittently

  • Unstable

    • Auth times competing with RADIUS timeout (5s)

    • Burst retries during peaks

    • Queue growth and CPU spikes

    • Posture loops or fallback behavior


Key Operational Insight

A topology is “good” not because nothing ever exceeds 500 ms, but because the total authentication cost consistently stays far away from protocol timers.

In well-designed environments, latency spikes remain noise. In poorly designed ones, they become the dominant execution path.


5.2 Interpreting High Authentication Latency in ISE (≈ 2800 ms Scenario)

In some deployments, operators may observe authentication latency values clustering around ~2800 ms (2.8 seconds) for individual authentication transactions.

This range represents a fundamentally different operating condition compared to the commonly referenced ~500 ms warning level.

At ~2800 ms, the system is no longer consuming margin — it is actively competing with protocol timers.

This value reflects an aggregated end-to-end authentication cost, typically including:

  • Multiple RADIUS and EAP exchanges

  • Identity store operations (DNS, Kerberos, LDAP)

  • Policy evaluation

  • Internal queueing delays

At this point, latency is no longer incidental; it is structural.


5.3 What ~2800 ms Really Means

In practical terms, an authentication latency around ~2800 ms implies:

  • A large portion of the first RADIUS timeout is already consumed

  • Any additional delay (jitter, backend slowdown, load spike) can push transactions into retry territory

  • Successful authentications may still occur, but timing safety margins are nearly exhausted

~2800 ms is not a warning. It is an indication that the system is operating inside the failure envelope.


5.4 System Behavior at This Level

In environments where ~2800 ms latency is observed with regularity, common characteristics include:

  • RADIUS retries occurring intermittently

  • Authentication success dependent on load conditions

  • Increased variance between identical authentication attempts

  • Early signs of queue buildup

  • Posture increasingly entering “Unknown” or delayed states

From the user perspective, this may still appear as “working”, but from a system perspective, execution is already fragile.


5.5 Mapping ~2800 ms to Total Cost and Risk

Metric
Acceptable
Degraded
Critical

Typical auth latency

< 300 ms

300–500 ms

~2800 ms frequent

p95 auth latency

< 2 s

2–3 s

≥ 3 s

p99 auth latency

< 3 s

3–5 s

Competes with RADIUS timeout

RADIUS retries

Rare

Periodic

Frequent

Queue growth

None

Transient

Sustained

Failure sensitivity

Low

Moderate

High

At ~2800 ms, even minor disturbances can trigger:

  • Retry synchronization

  • Rapid queue expansion

  • CPU and memory pressure

  • Cascading authentication failures


5.6 Why This Condition Often Precedes Sudden Outages

A common pattern at this stage is:

  • Authentication appears mostly successful

  • Latency remains consistently high

  • Timers are increased to “stabilize” the system

  • Load increases or a backend dependency slows slightly

  • Retries align across sessions

  • Queues spike

  • Authentication collapses rapidly

Systems operating at ~2800 ms are often one small incident away from outage.


5.7 Correct Design Interpretation

From a design and architecture perspective:

  • ~2800 ms is outside acceptable operating range

  • It indicates excessive dependency latency, queueing, or topology misalignment

  • The problem is no longer tuning — it is architecture and locality

A healthy target remains:

  • Most authentications < 2 seconds

  • p99 comfortably below the first RADIUS timeout

  • Retries as exceptions, not a dependency

A stable design keeps authentication latency far from timer boundaries, not merely below them.


Last updated