production_edge_cases_advanced 20 Q&As

Production Edge Cases Advanced FAQ & Answers

20 expert Production Edge Cases Advanced answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

20 questions
A

When Node.js receives SIGTERM with no handler, it exits immediately - promises requiring event loop won't execute. Solution: Use process.on('SIGTERM', gracefulShutdown) not process.once. In shutdown: (1) Stop accepting new requests (server.close()), (2) Wait for in-flight requests with timeout (Promise.race), (3) Close DB connections (await db.close()), (4) DON'T call process.exit() - let Node exit naturally when event loop is empty. Use process.on to allow cleanup, not process.once. Pattern: async function gracefulShutdown(signal) { server.close(); await Promise.race([waitForInflight(), timeout(30000)]); await db.close(); /* Node exits naturally */ }

99% confidence
A

When all connections exhausted, new requestors BLOCK indefinitely until timeout (default 15s), then fail with synthetic error. No automatic recovery. Solutions: (1) Set connectionTimeoutMillis: 5000 to fail fast, (2) Implement circuit breaker pattern checking pool.totalCount - pool.idleCount saturation before queries, (3) Use request queuing with PQueue (concurrency < pool size), (4) Graceful degradation: serve from cache when circuit opens. Advanced: Monitor saturation (usage/totalCount), if >80% open circuit, fall back to cached data. Prevention: Use PgBouncer transaction pooling, limit app connections to 70% of max_connections, implement backpressure.

99% confidence
A

Multiple requests racing to /refresh can cause user logout when stale token overwrites client storage. Solutions: (1) Cookie Expiry Delta: Subtract 2 min from cookie expiry so browser stops sending before server considers expired (simplest), (2) Client-Side Locking: Use shared promise - if refresh in progress, wait for it instead of starting new refresh (most reliable), (3) Server-Side Redis Cache: Cache refresh result with 1s TTL so racing requests get same cached tokens, (4) Refresh Token Rotation with Reuse Detection: Invalidate entire token family if previously-used token is sent (security). Best practice: Combine client locking + server cache + reuse detection. Code: this.refreshPromise ||= this.refreshToken().finally(() => this.refreshPromise = null).

99% confidence
A

Container crashes with 'no space left on device', often entering zombie state. Prevention: (1) Use multi-stage builds to minimize image size, (2) Set container storage limits: docker run --storage-opt size=10G, (3) Configure log rotation inside container, (4) Monitor disk space in app code (check-disk-space npm package), trigger cleanup at 90%. Recovery: docker exec df -h to check usage, docker system prune -a --volumes -f on host (removes ALL unused), find /tmp -type f -atime +7 -delete inside container. Last resort: docker commit backup:latest, restart with bigger volume. Monitoring: setInterval(() => checkDiskSpace('/'), 5min), alert at 90%, auto-cleanup temp files. Prevention > Recovery.

99% confidence
A

WebSocket connections close after exactly 60s idle due to proxy_read_timeout default. Nginx closes connection if upstream doesn't transmit within timeout. Solution: (1) Set proxy_read_timeout 7d; for WebSocket locations, (2) Set proxy_http_version 1.1; and proper headers: proxy_set_header Upgrade $http_upgrade; Connection $connection_upgrade;, (3) Disable buffering: proxy_buffering off; proxy_cache off;, (4) Application-level keep-alive: Send WebSocket ping frames every 30s to reset timeout. Pattern: ws.ping() in setInterval(30000). Debugging: error_log /var/log/nginx/error.log debug; and log upstream_response_time. Map $http_upgrade correctly: map $http_upgrade $connection_upgrade {default upgrade; '' close;}. Critical: Both nginx config + app ping needed for reliability.

99% confidence
A

JWT rejected with 'Token is not valid yet' when clocks out of sync between issuer and validator. Even +1 second skew causes failures. Industry standards: Default clock tolerance is 5 minutes (300s), recommended minimum 1 minute (60s). Solutions: (1) Set clockTolerance in JWT verification: jwt.verify(token, secret, {clockTolerance: 60}), (2) Use NTP sync on all servers: apt-get install ntp, verify with timedatectl status, (3) Monitor clock drift: Fetch time from worldtimeapi.org, alert if skew >60s. Configs: Same DC: 60s, Cross-region: 120s, Mobile clients: 300s (less reliable clocks), High-security with NTP: 30s. Prevention: Implement NTP, monitor drift actively, design for inevitable clock skew in distributed systems.

99% confidence
A

Migration fails at 60%, leaving partial schema. Both blue and green environments broken sharing same DB. Solutions: (1) BACKWARD COMPATIBLE migrations only: 4-phase approach: Add column (nullable) → Dual-write period → Make NOT NULL → Remove old column (separate deployments), (2) Idempotent migrations: Check IF NOT EXISTS before ALTER, use transactions where possible for automatic rollback, (3) Logical replication for rollback: Setup Green→Blue replication after switchover for 1-hour rollback window, (4) Pre/Post split: Run additive changes BEFORE deployment, destructive changes AFTER confirmed. Pattern: DO $$ BEGIN IF NOT EXISTS (...) THEN ALTER TABLE ... Recovery: Check schema_migrations table, manually complete or keep both columns (fix next deployment). Never force-remove during incident. Key: All migrations must be backward compatible.

99% confidence
A

OpenTelemetry can cause 30-80% performance degradation with full tracing. Java: 30% overhead even when disabled. Node.js: 80% reduction in req/sec with HTTP instrumentation. Solutions: (1) Aggressive sampling: TraceIdRatioBasedSampler(0.01) = 1% of traces, conditional sampling (always sample errors), (2) Disable unnecessary instrumentations: -Dotel.instrumentation.fs.enabled=false, (3) BatchSpanProcessor: Batch spans before sending (maxQueueSize: 2048, scheduledDelayMillis: 5000), (4) Async span export: Use BatchSpanProcessor not SimpleSpanProcessor (blocks event loop), (5) Collector-level sampling: probabilistic_sampler: 1%. Benchmarks: 100% tracing = 80% degradation, 1% sampling + batch = 5% degradation. Target: <5% overhead. Feature flags for gradual rollout.

99% confidence
A

Circuit breaker has 3 states: Closed (normal), Open (failing, blocking calls), Half-Open (testing recovery). Critical edge cases: (1) Thundering herd on recovery: When circuit transitions Half-Open→Closed, all queued requests rush in. Solution: Limit concurrent requests in Half-Open (1-3 requests), add exponential backoff with full jitter to stagger retries, (2) Cascading circuit opens: Service A circuit opens → Service B timeout → B's circuit opens. Solution: Tune timeouts hierarchically (upstream shorter than downstream), implement bulkhead isolation, (3) Circuit stuck open: Service recovered but circuit stays open. Solution: Aggressive Half-Open attempts (30s interval), adaptive thresholds using sliding window (not consecutive failures), (4) Idempotence violations: Retries on non-idempotent operations cause data corruption. Solution: Only retry safe operations (GET, idempotent POST). 2025 best practices: Use Resilience4j or Istio service mesh (Hystrix deprecated), implement SRE metrics (MTTD/MTTR), adaptive ML-based threshold tuning. Config: threshold: 5 failures, timeout: 30s, resetTimeout: 60s, maxRetries: 3, jitter: 0-1000ms.

99% confidence
A

Production builds minified + mangled, sourcemaps essential for debugging. 2025 best practice: Use hidden-source-map to generate maps without exposing them publicly. Configuration: (1) TypeScript: Set sourceMap: true in tsconfig.json, (2) Webpack: devtool: 'hidden-source-map' (not 'source-map' or 'eval'), (3) Vite: build.sourcemap: 'hidden' in vite.config.js. Security: Never expose .map files to end users - upload privately to error tracking services only. CI/CD upload: Add build step using provider CLI: Sentry CLI (sentry-cli sourcemaps upload), Datadog CLI (datadog-ci sourcemaps upload), TrackJS, Atatus. Automated plugins available for all major bundlers. Debugging: Error tracking services (Sentry, Datadog, Rollbar) use uploaded maps to de-minify stack traces automatically. For Node.js: node --enable-source-maps app.js (Node 12.12+). Browser DevTools: Check Sources tab for original TypeScript files. Common issue: sourceRoot mismatch - set sourceRoot: '/' in tsconfig. Critical: Maps contain your source code - treat as sensitive, upload only to trusted services.

99% confidence
A

When Node.js receives SIGTERM with no handler, it exits immediately - promises requiring event loop won't execute. Solution: Use process.on('SIGTERM', gracefulShutdown) not process.once. In shutdown: (1) Stop accepting new requests (server.close()), (2) Wait for in-flight requests with timeout (Promise.race), (3) Close DB connections (await db.close()), (4) DON'T call process.exit() - let Node exit naturally when event loop is empty. Use process.on to allow cleanup, not process.once. Pattern: async function gracefulShutdown(signal) { server.close(); await Promise.race([waitForInflight(), timeout(30000)]); await db.close(); /* Node exits naturally */ }

99% confidence
A

When all connections exhausted, new requestors BLOCK indefinitely until timeout (default 15s), then fail with synthetic error. No automatic recovery. Solutions: (1) Set connectionTimeoutMillis: 5000 to fail fast, (2) Implement circuit breaker pattern checking pool.totalCount - pool.idleCount saturation before queries, (3) Use request queuing with PQueue (concurrency < pool size), (4) Graceful degradation: serve from cache when circuit opens. Advanced: Monitor saturation (usage/totalCount), if >80% open circuit, fall back to cached data. Prevention: Use PgBouncer transaction pooling, limit app connections to 70% of max_connections, implement backpressure.

99% confidence
A

Multiple requests racing to /refresh can cause user logout when stale token overwrites client storage. Solutions: (1) Cookie Expiry Delta: Subtract 2 min from cookie expiry so browser stops sending before server considers expired (simplest), (2) Client-Side Locking: Use shared promise - if refresh in progress, wait for it instead of starting new refresh (most reliable), (3) Server-Side Redis Cache: Cache refresh result with 1s TTL so racing requests get same cached tokens, (4) Refresh Token Rotation with Reuse Detection: Invalidate entire token family if previously-used token is sent (security). Best practice: Combine client locking + server cache + reuse detection. Code: this.refreshPromise ||= this.refreshToken().finally(() => this.refreshPromise = null).

99% confidence
A

Container crashes with 'no space left on device', often entering zombie state. Prevention: (1) Use multi-stage builds to minimize image size, (2) Set container storage limits: docker run --storage-opt size=10G, (3) Configure log rotation inside container, (4) Monitor disk space in app code (check-disk-space npm package), trigger cleanup at 90%. Recovery: docker exec df -h to check usage, docker system prune -a --volumes -f on host (removes ALL unused), find /tmp -type f -atime +7 -delete inside container. Last resort: docker commit backup:latest, restart with bigger volume. Monitoring: setInterval(() => checkDiskSpace('/'), 5min), alert at 90%, auto-cleanup temp files. Prevention > Recovery.

99% confidence
A

WebSocket connections close after exactly 60s idle due to proxy_read_timeout default. Nginx closes connection if upstream doesn't transmit within timeout. Solution: (1) Set proxy_read_timeout 7d; for WebSocket locations, (2) Set proxy_http_version 1.1; and proper headers: proxy_set_header Upgrade $http_upgrade; Connection $connection_upgrade;, (3) Disable buffering: proxy_buffering off; proxy_cache off;, (4) Application-level keep-alive: Send WebSocket ping frames every 30s to reset timeout. Pattern: ws.ping() in setInterval(30000). Debugging: error_log /var/log/nginx/error.log debug; and log upstream_response_time. Map $http_upgrade correctly: map $http_upgrade $connection_upgrade {default upgrade; '' close;}. Critical: Both nginx config + app ping needed for reliability.

99% confidence
A

JWT rejected with 'Token is not valid yet' when clocks out of sync between issuer and validator. Even +1 second skew causes failures. Industry standards: Default clock tolerance is 5 minutes (300s), recommended minimum 1 minute (60s). Solutions: (1) Set clockTolerance in JWT verification: jwt.verify(token, secret, {clockTolerance: 60}), (2) Use NTP sync on all servers: apt-get install ntp, verify with timedatectl status, (3) Monitor clock drift: Fetch time from worldtimeapi.org, alert if skew >60s. Configs: Same DC: 60s, Cross-region: 120s, Mobile clients: 300s (less reliable clocks), High-security with NTP: 30s. Prevention: Implement NTP, monitor drift actively, design for inevitable clock skew in distributed systems.

99% confidence
A

Migration fails at 60%, leaving partial schema. Both blue and green environments broken sharing same DB. Solutions: (1) BACKWARD COMPATIBLE migrations only: 4-phase approach: Add column (nullable) → Dual-write period → Make NOT NULL → Remove old column (separate deployments), (2) Idempotent migrations: Check IF NOT EXISTS before ALTER, use transactions where possible for automatic rollback, (3) Logical replication for rollback: Setup Green→Blue replication after switchover for 1-hour rollback window, (4) Pre/Post split: Run additive changes BEFORE deployment, destructive changes AFTER confirmed. Pattern: DO $$ BEGIN IF NOT EXISTS (...) THEN ALTER TABLE ... Recovery: Check schema_migrations table, manually complete or keep both columns (fix next deployment). Never force-remove during incident. Key: All migrations must be backward compatible.

99% confidence
A

OpenTelemetry can cause 30-80% performance degradation with full tracing. Java: 30% overhead even when disabled. Node.js: 80% reduction in req/sec with HTTP instrumentation. Solutions: (1) Aggressive sampling: TraceIdRatioBasedSampler(0.01) = 1% of traces, conditional sampling (always sample errors), (2) Disable unnecessary instrumentations: -Dotel.instrumentation.fs.enabled=false, (3) BatchSpanProcessor: Batch spans before sending (maxQueueSize: 2048, scheduledDelayMillis: 5000), (4) Async span export: Use BatchSpanProcessor not SimpleSpanProcessor (blocks event loop), (5) Collector-level sampling: probabilistic_sampler: 1%. Benchmarks: 100% tracing = 80% degradation, 1% sampling + batch = 5% degradation. Target: <5% overhead. Feature flags for gradual rollout.

99% confidence
A

Circuit breaker has 3 states: Closed (normal), Open (failing, blocking calls), Half-Open (testing recovery). Critical edge cases: (1) Thundering herd on recovery: When circuit transitions Half-Open→Closed, all queued requests rush in. Solution: Limit concurrent requests in Half-Open (1-3 requests), add exponential backoff with full jitter to stagger retries, (2) Cascading circuit opens: Service A circuit opens → Service B timeout → B's circuit opens. Solution: Tune timeouts hierarchically (upstream shorter than downstream), implement bulkhead isolation, (3) Circuit stuck open: Service recovered but circuit stays open. Solution: Aggressive Half-Open attempts (30s interval), adaptive thresholds using sliding window (not consecutive failures), (4) Idempotence violations: Retries on non-idempotent operations cause data corruption. Solution: Only retry safe operations (GET, idempotent POST). 2025 best practices: Use Resilience4j or Istio service mesh (Hystrix deprecated), implement SRE metrics (MTTD/MTTR), adaptive ML-based threshold tuning. Config: threshold: 5 failures, timeout: 30s, resetTimeout: 60s, maxRetries: 3, jitter: 0-1000ms.

99% confidence
A

Production builds minified + mangled, sourcemaps essential for debugging. 2025 best practice: Use hidden-source-map to generate maps without exposing them publicly. Configuration: (1) TypeScript: Set sourceMap: true in tsconfig.json, (2) Webpack: devtool: 'hidden-source-map' (not 'source-map' or 'eval'), (3) Vite: build.sourcemap: 'hidden' in vite.config.js. Security: Never expose .map files to end users - upload privately to error tracking services only. CI/CD upload: Add build step using provider CLI: Sentry CLI (sentry-cli sourcemaps upload), Datadog CLI (datadog-ci sourcemaps upload), TrackJS, Atatus. Automated plugins available for all major bundlers. Debugging: Error tracking services (Sentry, Datadog, Rollbar) use uploaded maps to de-minify stack traces automatically. For Node.js: node --enable-source-maps app.js (Node 12.12+). Browser DevTools: Check Sources tab for original TypeScript files. Common issue: sourceRoot mismatch - set sourceRoot: '/' in tsconfig. Critical: Maps contain your source code - treat as sensitive, upload only to trusted services.

99% confidence