circuit_breaker_edge_cases 10 Q&As

Circuit Breaker Edge Cases FAQ & Answers

10 expert Circuit Breaker Edge Cases answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

10 questions
A

When circuit transitions from Open to Half-Open, all queued requests rush to recovering service simultaneously, overwhelming it again. Circuit opens, 1000 requests queue up, circuit goes Half-Open, all 1000 requests hit service at once, service crashes, circuit opens again. Infinite loop. Solution: Limit concurrent requests in Half-Open state to small number (1-5 requests). Pattern: if (state === 'HALF_OPEN' && activeRequests >= 3) return reject('Circuit testing'). Only test recovery with minimal load. Successful responses gradually increase allowed concurrency.

99% confidence
A

Service A circuit opens, causes Service B timeouts waiting for A, triggers B's circuit to open, causes Service C timeouts, triggers C's circuit. Cascade effect. Example: Payment service circuit opens → Order service times out waiting for payment → Order circuit opens → Frontend times out → User sees errors. Solution: (1) Tune timeouts to be shorter than dependent service circuit threshold (fast-fail), (2) Implement fallbacks at each level, (3) Use bulkhead pattern to isolate failure domains. Pattern: Set timeout = 2 seconds, circuit threshold = 5 failures. Service fails in 2s, circuit opens after 10s total (5×2s).

99% confidence
A

Circuit stays open for the full resetTimeout duration, even if the downstream service recovers earlier. Circuit breaker waits the entire resetTimeout period before transitioning to Half-Open to test recovery. If resetTimeout is 5 minutes but service recovers in 30 seconds, circuit remains unnecessarily open for 4.5 extra minutes, blocking all requests. This is by design - circuit breaker cannot know service recovered without testing, and testing only happens after resetTimeout expires. Common mistake: Setting resetTimeout too long (10-15 minutes) based on worst-case recovery time, causing extended outages even for quick recoveries.

99% confidence
A

Use shorter resetTimeout (30-60 seconds) with exponential backoff on repeated failures to balance quick recovery with avoiding thundering herd. Start with 30-second resetTimeout - if Half-Open test succeeds, circuit closes immediately. If test fails, double the resetTimeout for next attempt: 30s → 60s → 120s → 240s, capping at 5 minutes maximum. Configuration pattern: { resetTimeout: 30000, backoffMultiplier: 2, maxResetTimeout: 300000 }. How it works: Circuit opens → Wait 30s → Test in Half-Open (1-3 requests) → If success: close circuit, if fail: wait 60s → Test again → If fail: wait 120s → Repeat until success or max timeout reached. Rationale: Quick initial recovery attempt (30s) catches fast recoveries, exponential backoff prevents hammering persistently failing services, cap prevents infinite waits. Avoid: Fixed long timeouts (5-10 minutes) delay recovery for quick failures. Avoid: Very short timeouts (<10s) hammer failing services and waste resources.

99% confidence
A

Temporary network blips or single slow request shouldn't trigger circuit. Use sliding window with minimum request threshold. Pattern: Track last 20 requests in sliding window, open circuit only if >50% failed AND >10 requests total. Config: { threshold: 0.5, minimumRequests: 10, windowSize: 20 }. This prevents: one-off network errors, slow requests during load spikes, transient failures. Also use time-based window: count failures in last 60 seconds, not just last N requests. Avoid: consecutive failure counting (5 failures in row) - too sensitive. Use: percentage in time window - more stable.

99% confidence
A

Threshold is failure percentage that triggers circuit opening. Common confusion: absolute count vs percentage. Absolute: failureThreshold: 5 means open after 5 failures (simple but inflexible). Percentage: threshold: 0.5 with minimumRequests: 10 means open if >50% fail in window AND at least 10 requests made. Percentage is better for variable load. Config example: { threshold: 0.5, minimumRequests: 20, windowDuration: 60000 } = Open if >50% of requests fail, minimum 20 requests in 60-second window. Low-traffic services need lower minimumRequests (5-10), high-traffic need higher (50-100).

99% confidence
A

Yes, always provide fallback for graceful degradation. Fallback executes when circuit is open or request fails. Patterns: (1) Return cached data: fallback: () => cache.get(key), (2) Return default value: fallback: () => ({ status: 'degraded', data: [] }), (3) Call backup service: fallback: () => backupAPI.call(), (4) Return error with context: fallback: (err) => ({ error: 'Service unavailable', retry: true }). Fallback should be: fast (<100ms), reliable (no external calls unless backup), informative (tell user service is degraded). Don't: Call same failing service, make slow database queries, throw errors.

99% confidence
A

Emit events on state changes and track metrics. Pattern: breaker.on('open', () => { logger.error('Circuit opened', { service: 'payment' }); metrics.increment('circuit.open'); alert('Payment circuit open'); }). Key events: 'open' (circuit opened), 'halfOpen' (testing recovery), 'close' (recovered), 'success'/'failure' (per request). Track metrics: time in open state, number of opens per hour, success rate in half-open, fallback invocation count. Alert on: circuit open >5 minutes, >3 opens in 1 hour, half-open test failures. Visualize in Grafana dashboard with circuit state timeline.

99% confidence
A

resetTimeout too long keeps circuit unnecessarily open after service recovers. Circuit opens, stays open for resetTimeout duration, then transitions to Half-Open to test. If resetTimeout = 5 minutes but service recovered in 30 seconds, circuit stays unnecessarily open for 4.5 extra minutes, rejecting valid requests. Solution: Use shorter resetTimeout (30-60 seconds) with exponential backoff on repeated failures. Pattern: resetTimeout: 30s on first open, 60s on second, 120s on third. Balances quick recovery testing vs avoiding thundering herd on still-failing service.

99% confidence
A

Track circuit state duration and alert on anomalies. Implement monitoring to measure time circuit stays open vs expected recovery time. Pattern: breaker.on('open', () => { startTime = Date.now(); logger.warn('Circuit opened'); }); breaker.on('halfOpen', () => { duration = Date.now() - startTime; if (duration > expectedRecovery * 2) alert('Circuit stuck open'); }). Alert if circuit stays open >2× expected recovery time (e.g., service typically recovers in 60s, alert if open >120s). Use metrics dashboard to track circuit open duration histogram, identify stuck circuits requiring resetTimeout tuning.

99% confidence