01-authentication-queues

Failure Modes: Authentication Queues (ISE / AAA Under Latency)

In NAC systems (e.g. Cisco ISE), latency is rarely a single number. What breaks user experience and/or security is the combined effect of:

  • RTT NAD ↔ PSN (RADIUS / EAP)

  • RTT PSN ↔ Identity Stores (AD / LDAP / IdP / MFA / OCSP / CRL / MDM)

  • CPU time and internal contention (thread queues, locks, GC, disk)

  • Retries and retransmission storms (NADs, supplicants, AAA clients)

When this combined latency exceeds NAD/supplicant timers, the architecture enters very specific failure modes.


1. What Actually Queues Inside ISE (and Why)

Cisco Live BRKSEC-3412 presents a practical view: latency appears in Live Logs as Step Latency (e.g. Evaluating Policy Group), allowing clear separation between:

  • External latency (identity stores, network, MFA, OCSP, DNS)

  • Internal latency (processing and queue contention)

1.1 Thread Pools / Queues That Become Bottlenecks (TAC View)

The RADIUS / policy pipeline uses multiple thread pools, for example:

  • Main (RADIUS I/O)

  • Policy (policy evaluation / PDP)

  • EapTls

  • ADIDStore

  • RestFetcher

The classic failure pattern under latency:

  1. Main receives Access-Request and creates a session

  2. Policy triggers policy evaluation (calling PIPs / identity stores)

  3. One step becomes slow (AD, REST, DNS, disk, cache, GC)

  4. The request remains “alive” until NAD timers expire

  5. The NAD retransmits → retry storm

  6. Load multiplies without a hard failure signal


2. Observable Production Symptoms (Real Cases)

2.1 High Authentication Latency Without AD in the Path

A real case shows:

  • Alarm: High Authentication Latency

  • Log entry: Evaluating Policy Group (Step latency=3961ms)

  • No AD involved at that stage

  • Multiple internal PIP lookups (Called-Station-ID, Location, etc.)

This is a clear signature of internal contention or indirect dependencies (DB, cache, DNS, internal lookups).

What breaks:

  • Authentication does not fail consistently

  • Throughput collapses due to growing queues

  • During peaks, NADs treat ISE as dead


2.2 ISE “Under Capacity” but >10s Auth Latency Due to DNS / Logging

Case (ISE 3.1 P1):

  • Authentication latency > 10 seconds during peak

  • PSN not at session limit

  • WLCs mark ISE as dead (RADIUS timeout)

  • Failover to secondary PSN

Root cause:

  • Remote Logging Target configured via FQDN

  • DNS resolution impacting internal queues

  • Switching to IPv4 → issue resolved immediately

Architectural lesson: latency can originate from non-obvious components (DNS / logging) and surface as AAA pipeline queueing.


2.3 External Latency (AD / DC) → Extreme Step Latency

Another report shows:

  • Step latency > 60,000 ms

  • RPC communication errors with Domain Controllers

  • Failover threshold exceeded

Typical pattern of:

  • Unstable external dependency

  • DC switching / RPC failures

  • AAA SLA collapse

  • Policy fallback or rejection behavior


3. How This Becomes an “Auth Storm”

3.1 Amplifiers (Exponential Effects)

  • Aggressive RADIUS timeout on NADs (e.g. 1s)

  • Fast retransmissions

  • Boot storms / shift changes / power recovery

  • CoA or mass reauthentication

  • One slow PSN redistributes load → cluster contamination

A recurring point in the community:

RADIUS timeout must cover RTT NAD↔ISE + RTT ISE↔Identity Store + policy evaluation + extra checks. Aggressive timeouts cause failures even without explicit ISE errors.


4. Practical Diagnosis (TAC-Style)

4.1 Step Latency as the Key Divider

  • Slow PIP / identity store step → external latency

  • Slow policy evaluation step, no external calls → internal contention / queueing

BRKSEC-3412 demonstrates this method explicitly.


4.2 Correlate With “What Changed” Outside AAA

Classic examples:

  • DNS changes

  • Syslog targets

  • Certificate validation

  • CRL / OCSP

  • REST integrations

Not everything that breaks authentication lives in AD or RADIUS.


5. Architectural Mitigations (Beyond Tuning)

5.1 Principle 1 — Reduce Synchronous Dependencies

  • Avoid policies that query multiple external systems on every auth

  • Treat enrichment as post-auth whenever possible


5.2 Principle 2 — Size for Real Latency, Not Lab Numbers

Cisco scale tests assume:

  • Inter-node latency < 5 ms

If your environment exceeds this:

  • Published scale numbers do not directly apply

Official performance guidance warns:

  • ISE ↔ AD / LDAP latency should be ~5 ms or less

  • Small increases can drastically reduce TPS


5.3 Principle 3 — Storm Control (Logical, Not Network)

  • Set NAD timeouts to match real worst-case latency

  • Use backoff / deadtime carefully

  • Avoid synchronized mass reauthentication


5.4 Principle 4 — Session Persistence (Load Balancers)

If a RADIUS load balancer is used:

  • Session stickiness is mandatory

  • Auth / reauth / accounting must hit the same PSN

  • Required for:

  • Caching

  • EAP Session Resume

  • Fast Reconnect

Without it → intermittent failures and “ghost issues”.


Failure Mode Checklist

  • Is there any external dependency in the critical auth path?

  • AD / MFA / OCSP / REST

  • Is DNS or logging via FQDN blocking processing?

  • Does NAD timeout cover RTT + real policy evaluation time?

  • Is there retry storm potential (boot storms / WLC behavior)?

  • Does the load balancer enforce RADIUS session persistence?


Last updated