When Node.js receives SIGTERM with no handler, it exits immediately - promises requiring event loop won't execute. Solution: Use process.on('SIGTERM', gracefulShutdown) not process.once. In shutdown: (1) Stop accepting new requests (server.close()), (2) Wait for in-flight requests with timeout (Promise.race), (3) Close DB connections (await db.close()), (4) DON'T call process.exit() - let Node exit naturally when event loop is empty. Use process.on to allow cleanup, not process.once. Pattern: async function gracefulShutdown(signal) { server.close(); await Promise.race([waitForInflight(), timeout(30000)]); await db.close(); /* Node exits naturally */ }
Production Edge Cases Advanced FAQ & Answers
20 expert Production Edge Cases Advanced answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
20 questionsWhen all connections exhausted, new requestors BLOCK indefinitely until timeout (default 15s), then fail with synthetic error. No automatic recovery. Solutions: (1) Set connectionTimeoutMillis: 5000 to fail fast, (2) Implement circuit breaker pattern checking pool.totalCount - pool.idleCount saturation before queries, (3) Use request queuing with PQueue (concurrency < pool size), (4) Graceful degradation: serve from cache when circuit opens. Advanced: Monitor saturation (usage/totalCount), if >80% open circuit, fall back to cached data. Prevention: Use PgBouncer transaction pooling, limit app connections to 70% of max_connections, implement backpressure.
Multiple requests racing to /refresh can cause user logout when stale token overwrites client storage. Solutions: (1) Cookie Expiry Delta: Subtract 2 min from cookie expiry so browser stops sending before server considers expired (simplest), (2) Client-Side Locking: Use shared promise - if refresh in progress, wait for it instead of starting new refresh (most reliable), (3) Server-Side Redis Cache: Cache refresh result with 1s TTL so racing requests get same cached tokens, (4) Refresh Token Rotation with Reuse Detection: Invalidate entire token family if previously-used token is sent (security). Best practice: Combine client locking + server cache + reuse detection. Code: this.refreshPromise ||= this.refreshToken().finally(() => this.refreshPromise = null).
Container crashes with 'no space left on device', often entering zombie state. Prevention: (1) Use multi-stage builds to minimize image size, (2) Set container storage limits: docker run --storage-opt size=10G, (3) Configure log rotation inside container, (4) Monitor disk space in app code (check-disk-space npm package), trigger cleanup at 90%. Recovery: docker exec
WebSocket connections close after exactly 60s idle due to proxy_read_timeout default. Nginx closes connection if upstream doesn't transmit within timeout. Solution: (1) Set proxy_read_timeout 7d; for WebSocket locations, (2) Set proxy_http_version 1.1; and proper headers: proxy_set_header Upgrade $http_upgrade; Connection $connection_upgrade;, (3) Disable buffering: proxy_buffering off; proxy_cache off;, (4) Application-level keep-alive: Send WebSocket ping frames every 30s to reset timeout. Pattern: ws.ping() in setInterval(30000). Debugging: error_log /var/log/nginx/error.log debug; and log upstream_response_time. Map $http_upgrade correctly: map $http_upgrade $connection_upgrade {default upgrade; '' close;}. Critical: Both nginx config + app ping needed for reliability.
JWT rejected with 'Token is not valid yet' when clocks out of sync between issuer and validator. Even +1 second skew causes failures. Industry standards: Default clock tolerance is 5 minutes (300s), recommended minimum 1 minute (60s). Solutions: (1) Set clockTolerance in JWT verification: jwt.verify(token, secret, {clockTolerance: 60}), (2) Use NTP sync on all servers: apt-get install ntp, verify with timedatectl status, (3) Monitor clock drift: Fetch time from worldtimeapi.org, alert if skew >60s. Configs: Same DC: 60s, Cross-region: 120s, Mobile clients: 300s (less reliable clocks), High-security with NTP: 30s. Prevention: Implement NTP, monitor drift actively, design for inevitable clock skew in distributed systems.
Migration fails at 60%, leaving partial schema. Both blue and green environments broken sharing same DB. Solutions: (1) BACKWARD COMPATIBLE migrations only: 4-phase approach: Add column (nullable) → Dual-write period → Make NOT NULL → Remove old column (separate deployments), (2) Idempotent migrations: Check IF NOT EXISTS before ALTER, use transactions where possible for automatic rollback, (3) Logical replication for rollback: Setup Green→Blue replication after switchover for 1-hour rollback window, (4) Pre/Post split: Run additive changes BEFORE deployment, destructive changes AFTER confirmed. Pattern: DO $$ BEGIN IF NOT EXISTS (...) THEN ALTER TABLE ... Recovery: Check schema_migrations table, manually complete or keep both columns (fix next deployment). Never force-remove during incident. Key: All migrations must be backward compatible.
OpenTelemetry can cause 30-80% performance degradation with full tracing. Java: 30% overhead even when disabled. Node.js: 80% reduction in req/sec with HTTP instrumentation. Solutions: (1) Aggressive sampling: TraceIdRatioBasedSampler(0.01) = 1% of traces, conditional sampling (always sample errors), (2) Disable unnecessary instrumentations: -Dotel.instrumentation.fs.enabled=false, (3) BatchSpanProcessor: Batch spans before sending (maxQueueSize: 2048, scheduledDelayMillis: 5000), (4) Async span export: Use BatchSpanProcessor not SimpleSpanProcessor (blocks event loop), (5) Collector-level sampling: probabilistic_sampler: 1%. Benchmarks: 100% tracing = 80% degradation, 1% sampling + batch = 5% degradation. Target: <5% overhead. Feature flags for gradual rollout.
Circuit breaker has 3 states: Closed (normal), Open (failing, blocking calls), Half-Open (testing recovery). Critical edge cases: (1) Thundering herd on recovery: When circuit transitions Half-Open→Closed, all queued requests rush in. Solution: Limit concurrent requests in Half-Open (1-3 requests), add exponential backoff with full jitter to stagger retries, (2) Cascading circuit opens: Service A circuit opens → Service B timeout → B's circuit opens. Solution: Tune timeouts hierarchically (upstream shorter than downstream), implement bulkhead isolation, (3) Circuit stuck open: Service recovered but circuit stays open. Solution: Aggressive Half-Open attempts (30s interval), adaptive thresholds using sliding window (not consecutive failures), (4) Idempotence violations: Retries on non-idempotent operations cause data corruption. Solution: Only retry safe operations (GET, idempotent POST). 2025 best practices: Use Resilience4j or Istio service mesh (Hystrix deprecated), implement SRE metrics (MTTD/MTTR), adaptive ML-based threshold tuning. Config: threshold: 5 failures, timeout: 30s, resetTimeout: 60s, maxRetries: 3, jitter: 0-1000ms.
Production builds minified + mangled, sourcemaps essential for debugging. 2025 best practice: Use hidden-source-map to generate maps without exposing them publicly. Configuration: (1) TypeScript: Set sourceMap: true in tsconfig.json, (2) Webpack: devtool: 'hidden-source-map' (not 'source-map' or 'eval'), (3) Vite: build.sourcemap: 'hidden' in vite.config.js. Security: Never expose .map files to end users - upload privately to error tracking services only. CI/CD upload: Add build step using provider CLI: Sentry CLI (sentry-cli sourcemaps upload), Datadog CLI (datadog-ci sourcemaps upload), TrackJS, Atatus. Automated plugins available for all major bundlers. Debugging: Error tracking services (Sentry, Datadog, Rollbar) use uploaded maps to de-minify stack traces automatically. For Node.js: node --enable-source-maps app.js (Node 12.12+). Browser DevTools: Check Sources tab for original TypeScript files. Common issue: sourceRoot mismatch - set sourceRoot: '/' in tsconfig. Critical: Maps contain your source code - treat as sensitive, upload only to trusted services.
When Node.js receives SIGTERM with no handler, it exits immediately - promises requiring event loop won't execute. Solution: Use process.on('SIGTERM', gracefulShutdown) not process.once. In shutdown: (1) Stop accepting new requests (server.close()), (2) Wait for in-flight requests with timeout (Promise.race), (3) Close DB connections (await db.close()), (4) DON'T call process.exit() - let Node exit naturally when event loop is empty. Use process.on to allow cleanup, not process.once. Pattern: async function gracefulShutdown(signal) { server.close(); await Promise.race([waitForInflight(), timeout(30000)]); await db.close(); /* Node exits naturally */ }
When all connections exhausted, new requestors BLOCK indefinitely until timeout (default 15s), then fail with synthetic error. No automatic recovery. Solutions: (1) Set connectionTimeoutMillis: 5000 to fail fast, (2) Implement circuit breaker pattern checking pool.totalCount - pool.idleCount saturation before queries, (3) Use request queuing with PQueue (concurrency < pool size), (4) Graceful degradation: serve from cache when circuit opens. Advanced: Monitor saturation (usage/totalCount), if >80% open circuit, fall back to cached data. Prevention: Use PgBouncer transaction pooling, limit app connections to 70% of max_connections, implement backpressure.
Multiple requests racing to /refresh can cause user logout when stale token overwrites client storage. Solutions: (1) Cookie Expiry Delta: Subtract 2 min from cookie expiry so browser stops sending before server considers expired (simplest), (2) Client-Side Locking: Use shared promise - if refresh in progress, wait for it instead of starting new refresh (most reliable), (3) Server-Side Redis Cache: Cache refresh result with 1s TTL so racing requests get same cached tokens, (4) Refresh Token Rotation with Reuse Detection: Invalidate entire token family if previously-used token is sent (security). Best practice: Combine client locking + server cache + reuse detection. Code: this.refreshPromise ||= this.refreshToken().finally(() => this.refreshPromise = null).
Container crashes with 'no space left on device', often entering zombie state. Prevention: (1) Use multi-stage builds to minimize image size, (2) Set container storage limits: docker run --storage-opt size=10G, (3) Configure log rotation inside container, (4) Monitor disk space in app code (check-disk-space npm package), trigger cleanup at 90%. Recovery: docker exec
WebSocket connections close after exactly 60s idle due to proxy_read_timeout default. Nginx closes connection if upstream doesn't transmit within timeout. Solution: (1) Set proxy_read_timeout 7d; for WebSocket locations, (2) Set proxy_http_version 1.1; and proper headers: proxy_set_header Upgrade $http_upgrade; Connection $connection_upgrade;, (3) Disable buffering: proxy_buffering off; proxy_cache off;, (4) Application-level keep-alive: Send WebSocket ping frames every 30s to reset timeout. Pattern: ws.ping() in setInterval(30000). Debugging: error_log /var/log/nginx/error.log debug; and log upstream_response_time. Map $http_upgrade correctly: map $http_upgrade $connection_upgrade {default upgrade; '' close;}. Critical: Both nginx config + app ping needed for reliability.
JWT rejected with 'Token is not valid yet' when clocks out of sync between issuer and validator. Even +1 second skew causes failures. Industry standards: Default clock tolerance is 5 minutes (300s), recommended minimum 1 minute (60s). Solutions: (1) Set clockTolerance in JWT verification: jwt.verify(token, secret, {clockTolerance: 60}), (2) Use NTP sync on all servers: apt-get install ntp, verify with timedatectl status, (3) Monitor clock drift: Fetch time from worldtimeapi.org, alert if skew >60s. Configs: Same DC: 60s, Cross-region: 120s, Mobile clients: 300s (less reliable clocks), High-security with NTP: 30s. Prevention: Implement NTP, monitor drift actively, design for inevitable clock skew in distributed systems.
Migration fails at 60%, leaving partial schema. Both blue and green environments broken sharing same DB. Solutions: (1) BACKWARD COMPATIBLE migrations only: 4-phase approach: Add column (nullable) → Dual-write period → Make NOT NULL → Remove old column (separate deployments), (2) Idempotent migrations: Check IF NOT EXISTS before ALTER, use transactions where possible for automatic rollback, (3) Logical replication for rollback: Setup Green→Blue replication after switchover for 1-hour rollback window, (4) Pre/Post split: Run additive changes BEFORE deployment, destructive changes AFTER confirmed. Pattern: DO $$ BEGIN IF NOT EXISTS (...) THEN ALTER TABLE ... Recovery: Check schema_migrations table, manually complete or keep both columns (fix next deployment). Never force-remove during incident. Key: All migrations must be backward compatible.
OpenTelemetry can cause 30-80% performance degradation with full tracing. Java: 30% overhead even when disabled. Node.js: 80% reduction in req/sec with HTTP instrumentation. Solutions: (1) Aggressive sampling: TraceIdRatioBasedSampler(0.01) = 1% of traces, conditional sampling (always sample errors), (2) Disable unnecessary instrumentations: -Dotel.instrumentation.fs.enabled=false, (3) BatchSpanProcessor: Batch spans before sending (maxQueueSize: 2048, scheduledDelayMillis: 5000), (4) Async span export: Use BatchSpanProcessor not SimpleSpanProcessor (blocks event loop), (5) Collector-level sampling: probabilistic_sampler: 1%. Benchmarks: 100% tracing = 80% degradation, 1% sampling + batch = 5% degradation. Target: <5% overhead. Feature flags for gradual rollout.
Circuit breaker has 3 states: Closed (normal), Open (failing, blocking calls), Half-Open (testing recovery). Critical edge cases: (1) Thundering herd on recovery: When circuit transitions Half-Open→Closed, all queued requests rush in. Solution: Limit concurrent requests in Half-Open (1-3 requests), add exponential backoff with full jitter to stagger retries, (2) Cascading circuit opens: Service A circuit opens → Service B timeout → B's circuit opens. Solution: Tune timeouts hierarchically (upstream shorter than downstream), implement bulkhead isolation, (3) Circuit stuck open: Service recovered but circuit stays open. Solution: Aggressive Half-Open attempts (30s interval), adaptive thresholds using sliding window (not consecutive failures), (4) Idempotence violations: Retries on non-idempotent operations cause data corruption. Solution: Only retry safe operations (GET, idempotent POST). 2025 best practices: Use Resilience4j or Istio service mesh (Hystrix deprecated), implement SRE metrics (MTTD/MTTR), adaptive ML-based threshold tuning. Config: threshold: 5 failures, timeout: 30s, resetTimeout: 60s, maxRetries: 3, jitter: 0-1000ms.
Production builds minified + mangled, sourcemaps essential for debugging. 2025 best practice: Use hidden-source-map to generate maps without exposing them publicly. Configuration: (1) TypeScript: Set sourceMap: true in tsconfig.json, (2) Webpack: devtool: 'hidden-source-map' (not 'source-map' or 'eval'), (3) Vite: build.sourcemap: 'hidden' in vite.config.js. Security: Never expose .map files to end users - upload privately to error tracking services only. CI/CD upload: Add build step using provider CLI: Sentry CLI (sentry-cli sourcemaps upload), Datadog CLI (datadog-ci sourcemaps upload), TrackJS, Atatus. Automated plugins available for all major bundlers. Debugging: Error tracking services (Sentry, Datadog, Rollbar) use uploaded maps to de-minify stack traces automatically. For Node.js: node --enable-source-maps app.js (Node 12.12+). Browser DevTools: Check Sources tab for original TypeScript files. Common issue: sourceRoot mismatch - set sourceRoot: '/' in tsconfig. Critical: Maps contain your source code - treat as sensitive, upload only to trusted services.