CPU time and internal contention
(thread queues, locks, GC, disk)
Retries and retransmission storms
(NADs, supplicants, AAA clients)
When this combined latency exceeds NAD/supplicant timers, the architecture enters very specific failure modes.
1. What Actually Queues Inside ISE (and Why)
Cisco Live BRKSEC-3412 presents a practical view:
latency appears in Live Logs as Step Latency (e.g. Evaluating Policy Group), allowing clear separation between:
This is a clear signature of internal contention or indirect dependencies (DB, cache, DNS, internal lookups).
What breaks:
Authentication does not fail consistently
Throughput collapses due to growing queues
During peaks, NADs treat ISE as dead
2.2 ISE “Under Capacity” but >10s Auth Latency Due to DNS / Logging
Case (ISE 3.1 P1):
Authentication latency > 10 seconds during peak
PSN not at session limit
WLCs mark ISE as dead (RADIUS timeout)
Failover to secondary PSN
Root cause:
Remote Logging Target configured via FQDN
DNS resolution impacting internal queues
Switching to IPv4 → issue resolved immediately
Architectural lesson:
latency can originate from non-obvious components (DNS / logging)
and surface as AAA pipeline queueing.
2.3 External Latency (AD / DC) → Extreme Step Latency
Another report shows:
Step latency > 60,000 ms
RPC communication errors with Domain Controllers
Failover threshold exceeded
Typical pattern of:
Unstable external dependency
DC switching / RPC failures
AAA SLA collapse
Policy fallback or rejection behavior
3. How This Becomes an “Auth Storm”
3.1 Amplifiers (Exponential Effects)
Aggressive RADIUS timeout on NADs (e.g. 1s)
Fast retransmissions
Boot storms / shift changes / power recovery
CoA or mass reauthentication
One slow PSN redistributes load → cluster contamination
A recurring point in the community:
RADIUS timeout must cover
RTT NAD↔ISE + RTT ISE↔Identity Store + policy evaluation + extra checks.
Aggressive timeouts cause failures even without explicit ISE errors.