docker_kubernetes 19 Q&As

Docker Kubernetes FAQ & Answers

19 expert Docker Kubernetes answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

19 questions
A

Docker security in production requires multi-layered defense. Top practices: (1) Non-root users: create dedicated user in Dockerfile: RUN useradd -m appuser && chown -R appuser /app; USER appuser. Prevents privilege escalation attacks. (2) Read-only filesystem: docker run --read-only --tmpfs /tmp myapp. Forces immutable infrastructure. (3) Resource limits: docker run --memory=512m --cpus=1 myapp prevents container bombs, DoS attacks. (4) Security scanning: integrate Trivy, Snyk in CI/CD, fail builds on HIGH/CRITICAL vulnerabilities. (5) Minimal base images: prefer distroless or Alpine (10-50x smaller attack surface than full OS). (6) No secrets in images: use Docker secrets, environment variables, or secret managers (Vault, AWS Secrets Manager). (7) Network policies: default deny, explicit allow. (8) Regular updates: automate base image updates, rebuild monthly minimum. According to 2025 surveys: 67% update Kubernetes regularly, 53% block exposed ports, 52% enable RBAC. Multi-stage builds: separate build and runtime stages to exclude build tools from final image. Best practice: run security audits with docker bench-security, implement least-privilege principle.

99% confidence
A

Multi-stage builds separate build and runtime environments, reducing final image size by 60-90% and eliminating build tools from production. Basic pattern: FROM node:20 AS builder; WORKDIR /app; COPY package*.json ./; RUN npm ci; COPY . .; RUN npm run build; FROM node:20-alpine; COPY --from=builder /app/dist ./dist; CMD ['node', 'dist/main.js']. Builder stage includes dev dependencies, final stage only runtime. Advanced patterns: (1) Multiple builders: separate stages for different compilation steps (TypeScript, CSS, assets), (2) Parallel builds: COPY --from=builder1 and COPY --from=builder2 in final stage, (3) Build cache optimization: RUN --mount=type=cache,target=/root/.npm npm ci uses BuildKit cache mounts (5-10x faster rebuilds), (4) Secret mounting: RUN --mount=type=secret,id=npmrc npm ci passes secrets without storing in layers. Benefits: reduced attack surface (no compilers, dev tools), faster deployments (smaller images), layer caching optimizes rebuilds. Example: Next.js app goes from 1.2GB (single stage) to 180MB (multi-stage). Security: build stage can use privileged operations, runtime stage stays minimal. Best practice: use official slim/alpine variants for final stage, order COPY commands by change frequency (package files before source). BuildKit is default in Docker Engine 23.0+ (2023).

99% confidence
A

Deployments manage stateless applications with interchangeable pods, StatefulSets manage stateful applications requiring stable identity and storage. Key differences: (1) Pod identity: Deployments use random names (app-xyz), StatefulSets use ordered names (app-0, app-1, app-2) that persist across restarts, (2) Storage: Deployments share storage or use ephemeral volumes, StatefulSets create PersistentVolumeClaim per pod with stable binding, (3) Scaling: Deployments scale randomly, StatefulSets scale sequentially (0→1→2, terminate 2→1→0), (4) Updates: Deployments do rolling updates in parallel, StatefulSets update one pod at a time maintaining order. Use Deployments for: web servers, APIs, stateless microservices, workers processing from queues - anything where pods are identical and replaceable. Use StatefulSets for: databases (PostgreSQL, MySQL, MongoDB), message queues (Kafka, RabbitMQ), distributed systems requiring member coordination (Elasticsearch, ZooKeeper, etcd) - anything needing stable network identity or persistent state. Example: Deployment manifest: replicas: 3; strategy: RollingUpdate; maxSurge: 1. StatefulSet manifest: replicas: 3; serviceName: my-db; volumeClaimTemplates: [...]. Performance: StatefulSets have slower startup/scaling due to sequential operations. Best practice: default to Deployments, use StatefulSets only when truly needed (adds complexity).

99% confidence
A

HPA automatically scales pod replicas based on observed metrics, essential for handling variable load in production environments. Mechanism: HPA controller queries metrics-server every 15s (default, configurable via --horizontal-pod-autoscaler-sync-period), calculates desired replicas using formula desired = ceil(current * (current_metric / target_metric)), then updates Deployment/StatefulSet spec. Kubernetes 1.29+ (2025) supports autoscaling/v2 API with multiple metric sources. Built-in resource metrics: CPU utilization via kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10 scales when average CPU exceeds 70%, memory scaling requires metrics-server v0.6.0+ with memory metrics enabled. Advanced custom metrics require external adapters: Prometheus Adapter (most popular), Datadog Cluster Agent, or KEDA (Kubernetes Event-Driven Autoscaling). Production metric strategies: (1) HTTP request rate - scale API pods when requests/sec exceeds threshold (typical: 100-500 req/sec per pod), (2) Queue depth - scale workers based on RabbitMQ queue length or Kafka consumer lag (scale at 1000+ pending messages), (3) Business metrics - active WebSocket connections, concurrent database queries, order processing rate. Example multi-metric HPA: combine CPU utilization 70% target + custom http_requests metric 1000 req/sec average, HPA chooses metric requiring most pods. Best practices: (1) Set minReplicas at 2+ for high availability (survives node failure), (2) CPU target 70-80% leaves headroom for traffic spikes (avoid 90%+), (3) Set maxReplicas based on cluster capacity and cost limits, (4) Use VPA (Vertical Pod Autoscaler) for right-sizing resource requests, HPA for horizontal scaling, (5) Configure behavior field in v2 API: behavior: {scaleDown: {stabilizationWindowSeconds: 300, policies: [{type: Percent, value: 50, periodSeconds: 60}]}} prevents thrashing by limiting scale-down to 50% per minute with 5-minute stabilization. Performance impact: HPA adds ~10ms scheduling latency, metrics-server consumes ~100MB memory per 1000 pods. Common pitfalls: (1) Slow-starting applications (JVM, ML models) without proper readiness probes cause HPA thrashing - set initialDelaySeconds appropriately, (2) Missing cooldown periods lead to rapid scale up/down cycles (set scaleDown stabilizationWindowSeconds: 300), (3) Using only CPU metrics misses application-specific bottlenecks (combine with custom metrics), (4) Insufficient cluster capacity blocks scale-up (use Cluster Autoscaler). Real-world example: e-commerce site scales from 3 pods (night) to 20 pods (peak shopping hours) based on request rate + CPU, saving 60% infrastructure cost vs static 20-pod deployment.

99% confidence
A

Resource requests/limits control pod resource allocation and Quality of Service (QoS) class. Requests: guaranteed minimum resources, used for scheduling decisions - node must have available resources ≥ requests. Limits: maximum resources pod can consume, enforced by kernel cgroups. Configuration: resources: {requests: {cpu: '250m', memory: '256Mi'}, limits: {cpu: '500m', memory: '512Mi'}}. QoS classes (determines eviction priority): (1) Guaranteed: requests == limits for all containers, highest priority, last to be evicted, (2) Burstable: requests < limits or only requests set, medium priority, (3) BestEffort: no requests/limits, lowest priority, first evicted. Scheduling: kube-scheduler sums all pod requests per node, schedules on nodes with available capacity. Over-commitment: node can run pods with total limits > node capacity (relies on pods not hitting limits simultaneously). CPU throttling: pod hitting CPU limit gets throttled (performance degradation), memory limit causes OOMKill (pod restart). Best practices: (1) Set requests based on average usage, limits at 2x requests for burst headroom, (2) Monitor actual usage with metrics-server or Prometheus, adjust over time, (3) Use LimitRanges to enforce namespace defaults, prevent unbounded pods, (4) For critical pods: requests == limits (Guaranteed QoS). Common mistakes: no requests (can't schedule properly), limits too low (OOMKills), no limits (one pod can starve others). Use Vertical Pod Autoscaler to recommend optimal values.

99% confidence
A

Service mesh provides observability, security, and traffic management for microservices without modifying application code. Istio 1.24+ (GA in 2024) uses ambient mode as alternative to sidecars, reducing resource overhead by 40-50%. Traditional sidecar mode: Istio injects Envoy proxy (1.29+) into each pod via mutating webhook, proxies intercept all inbound/outbound traffic via iptables rules. Ambient mode: Layer 4 ztunnel (zero-trust tunnel) runs per-node instead of per-pod, opt-in Layer 7 waypoint proxies for advanced features. Core capabilities: (1) Traffic management - intelligent routing for A/B testing (10% traffic to v2), canary deployments (gradual 10%→50%→100%), circuit breaking (max connections, pending requests), retries (exponential backoff), timeouts - all configured via VirtualService/DestinationRule CRDs. (2) Security - automatic mutual TLS encryption between all services (STRICT/PERMISSIVE modes), certificate rotation every 24h (default), L7 authorization policies based on JWT claims, source identity, HTTP methods. (3) Observability - distributed tracing integration (Jaeger, Zipkin) with 1% default sampling rate (configurable), RED metrics (Rate, Errors, Duration) exported to Prometheus, detailed access logs with request/response metadata. Implementation: install control plane via istioctl install --set profile=ambient for ambient mode or --set profile=default for sidecar, enable injection with namespace label istio-injection=enabled, deploy applications (sidecars auto-injected). Traffic routing example for canary: VirtualService routes 90% to stable subset, 10% to canary subset based on weights, DestinationRule defines subsets by pod labels. Production benefits: (1) Zero-touch mTLS across 100+ microservices (impossible manually), (2) Unified observability (service graph, latency percentiles, error rates), (3) Traffic shifting without redeploying apps (faster iteration), (4) Standardized resilience patterns (retries, timeouts, circuit breakers). Trade-offs carefully measured: sidecar mode adds 50-100MB memory per pod + 0.05-0.1 vCPU overhead, ambient mode reduces to 10-20MB per pod via shared node proxy, P50 latency impact 0.5-1ms, P99 impact 1-2ms (acceptable for most services). Resource costs: 100-pod cluster with sidecars requires +5-10GB memory, ambient mode requires +1-2GB. Alternatives comparison: Linkerd (Rust-based, 20MB memory per proxy, simpler but fewer features), Consul Connect (HashiCorp ecosystem integration), AWS App Mesh (managed, AWS-only). Best practices: (1) Start with ambient mode for new deployments (lower resource cost), (2) Enable strict mTLS after PERMISSIVE mode validation period, (3) Set conservative retry/timeout defaults (max 3 retries, 15s timeout), (4) Use PeerAuthentication CRD to enforce mTLS policy. Common pitfalls: (1) Debugging connection failures requires understanding Envoy config (use istioctl proxy-config commands), (2) Misconfigured DestinationRule subsets cause 503 errors, (3) VirtualService match order matters (first match wins), (4) Resource limits too low cause Envoy OOMKills. Use cases: Istio essential for >20 microservices needing unified security/observability, overkill for <10 services (use simpler ingress controller). 2025 adoption data: 35% of large enterprises use service mesh (up from 30% in 2024), ambient mode driving renewed interest due to lower cost. Financial services and healthcare lead adoption due to compliance requirements (audit trails, mTLS). Real-world impact: company with 80 microservices implemented Istio ambient mode, gained complete service graph visibility + mTLS encryption with 2GB memory overhead vs 8GB sidecar approach, reduced security incidents 60% via authorization policies.

99% confidence
A

GitOps uses Git as single source of truth for declarative infrastructure and applications, enabling automated deployments via pull-based reconciliation. Principle: Git repo contains all Kubernetes manifests, operators (ArgoCD, Flux) continuously sync cluster state to match repo. Workflow: (1) Developers push changes to Git, (2) ArgoCD/Flux detects changes, (3) Applies manifests to cluster, (4) Reconciles differences (self-healing). Implementation with ArgoCD: install ArgoCD: kubectl apply -n argocd -f install.yaml, create Application: apiVersion: argoproj.io/v1; kind: Application; spec: {source: {repoURL: 'github.com/org/app', path: 'k8s', targetRevision: HEAD}, destination: {server: 'https://kubernetes.default.svc', namespace: default}, syncPolicy: {automated: {prune: true, selfHeal: true}}}. Benefits: (1) Audit trail: all changes in Git history, (2) Rollback: git revert reverts cluster state, (3) Disaster recovery: restore cluster from Git, (4) Consistency: prevents kubectl drift, (5) Multi-cluster: manage 100+ clusters from single repo. Patterns: (1) Environment branches: dev/staging/prod branches, (2) App-of-apps: ArgoCD app that creates other apps (manages entire platform), (3) Helm/Kustomize integration: ArgoCD renders templates before applying. Security: restrict cluster access, all changes via Git (PR review), role-based access to repos. Observability: ArgoCD UI shows sync status, health, history. Trade-offs: learning curve, pull-based means ~30s-3min sync delay. Best practice: separate app code repo from GitOps config repo, use automated image updaters to trigger config updates on new images. 2025 adoption: over 65% of enterprise organizations implement GitOps practices.

99% confidence
A

Kubernetes probes detect unhealthy pods and control traffic routing, critical for zero-downtime deployments and self-healing. Three probe types: (1) Liveness: detects if pod is alive, kubelet kills and restarts on failure - use to recover from deadlocks/hangs, (2) Readiness: detects if pod can serve traffic, removes from Service endpoints on failure - use during startup or temporary unavailability, (3) Startup: allows slow-starting pods extended time before liveness kicks in - prevents premature kills. Configuration: livenessProbe: {httpGet: {path: /healthz, port: 8080}, initialDelaySeconds: 30, periodSeconds: 10, failureThreshold: 3}. Probe methods: httpGet (HTTP 200-399 = success), tcpSocket (TCP connection succeeds), exec (command exit 0). Best practices: (1) Liveness checks lightweight: avoid checking dependencies (DB, external APIs) or expensive operations - only check if process responds, (2) Readiness checks dependencies: DB connection, downstream services - determines if ready to serve, (3) Different endpoints: /livez for liveness (always passes unless deadlock), /readyz for readiness (checks dependencies), (4) Tune thresholds: initialDelaySeconds covers startup time, failureThreshold allows transient failures (3-5 retries over 30-50s). Common mistakes: (1) Same endpoint for liveness/readiness causes cascading failures (DB down → liveness fails → all pods restart), (2) No startup probe for slow apps (liveness kills during startup), (3) Too aggressive timeouts (false positives), (4) Expensive checks (slow probe execution). Example: Java app startup probe: initialDelaySeconds: 0, periodSeconds: 5, failureThreshold: 30 gives 150s startup window, then liveness takes over. Critical for: rolling updates (readiness ensures old pods drain before termination), autoscaling (only scales healthy pods), service mesh (observability).

99% confidence
A

NetworkPolicies provide Layer 3/4 firewall rules for pod-to-pod communication, critical for implementing zero-trust security in Kubernetes clusters. Default Kubernetes behavior: all pods can communicate with all pods across all namespaces (flat network), NetworkPolicies enable explicit allow-list model. Requires CNI plugin support: Calico 3.27+ (most popular, 40% market share), Cilium 1.15+ (eBPF-based, fastest performance), Weave Net, or Antrea - note that default kubenet and Flannel CNI plugins do NOT enforce NetworkPolicies. Basic deny-all baseline policy: create NetworkPolicy with empty podSelector (matches all pods in namespace) and empty ingress/egress arrays, effectively blocking all traffic. Production policy structure: specify podSelector (which pods this policy applies to), policyTypes array ([Ingress, Egress] or subset), ingress rules (allowed inbound traffic sources + ports), egress rules (allowed outbound destinations + ports). Example three-tier architecture: frontend policy allows ingress from nginx-ingress namespace on port 3000, egress to backend on port 8080; backend policy allows ingress from frontend on port 8080, egress to database namespace on port 5432; database policy allows ingress only from backend on port 5432, egress to nothing (deny all outbound). Zero-trust implementation steps: (1) Audit mode - deploy policies in Cilium with audit mode enabled or use Calico's policy recommendation tool to observe actual traffic patterns for 7-14 days without enforcement, (2) Default deny - apply deny-all NetworkPolicy to each namespace as baseline, (3) Explicit allows - create granular NetworkPolicy per service allowing only documented traffic flows, (4) Namespace isolation - use namespaceSelector with labels (e.g., env: production, team: payments) to restrict cross-namespace traffic, (5) External egress control - use ipBlock rules to allow specific external IPs (APIs, databases) while blocking general internet or internal RFC1918 ranges. Advanced patterns: (1) DNS egress - allow pods to reach kube-dns/CoreDNS on port 53 UDP (required for service discovery), (2) Metrics scraping - allow Prometheus to scrape metrics endpoints across namespaces via podSelector + namespaceSelector combination, (3) Service mesh integration - NetworkPolicies work alongside Istio/Linkerd providing defense-in-depth (NetworkPolicy enforces L3/4, service mesh enforces L7), (4) FQDN-based policies - Cilium supports DNS-aware policies allowing rules like toFQDNs: [{matchName: api.stripe.com}] instead of IP ranges. Production best practices: (1) Start with monitoring/audit mode (Cilium), analyze actual traffic before enforcing, (2) Apply default-deny incrementally namespace-by-namespace (not cluster-wide), (3) Label pods consistently (app, version, tier labels) for maintainable selectors, (4) Document allowed flows in Git alongside policy YAML, (5) Use namespace labels for environment isolation (dev/staging/prod), (6) Test policy changes in staging with identical traffic patterns. Performance metrics: Cilium eBPF achieves <5µs latency overhead per packet, Calico iptables-based adds ~10-20µs, both negligible for application performance. Scale limits: tested up to 10,000 pods with 1000+ NetworkPolicies without degradation. Common pitfalls: (1) Forgetting DNS egress rules breaks service discovery (pods can't resolve service names), (2) Overly broad selectors (matchLabels: {}) accidentally allow too much traffic, (3) No egress rules blocks Pod liveness/readiness probes to kubelet, (4) Testing in dev with fewer pods misses production edge cases, (5) Applying default-deny without explicit allows breaks existing services. Debugging tools: kubectl describe networkpolicy shows policy details, kubectl exec -it pod -- curl -v target-service tests connectivity, Cilium Hubble provides flow visualization, Calico calicoctl supports packet capture. Compliance benefits: NetworkPolicies satisfy PCI-DSS requirement 1.2 (network segmentation), HIPAA 164.312 (access controls), SOC 2 CC6.6 (logical access), provide audit trail of allowed/denied flows for compliance reporting. Real-world impact: financial services company implemented zero-trust NetworkPolicies across 200-pod production cluster, blocked 15 security incidents in first 6 months (lateral movement attempts), reduced blast radius of compromised pods from cluster-wide to single service. 2025 adoption: 65% of enterprises use NetworkPolicies in production (up from 50% in 2024), driven by compliance requirements and high-profile breach prevention.

99% confidence
A

Blue-green and canary enable zero-downtime deployments with quick rollback capability. Blue-green: run two identical environments (blue=current, green=new), switch traffic atomically. Implementation: (1) Two Deployments: blue-deployment (replicas: 3, version: v1) and green-deployment (replicas: 3, version: v2), (2) Service selector switches: kubectl patch service myapp -p '{"spec":{"selector":{"version":"v2"}}}' switches from blue to green instantly, (3) Rollback: patch selector back to v1. Benefits: instant cutover, easy rollback, safe testing (green runs alongside blue). Drawbacks: double resource cost during deployment, all-or-nothing switch. Canary: gradually shift traffic to new version while monitoring metrics. Implementation methods: (1) Manual replica adjustment: start with 1 new pod (10% traffic), gradually increase to 10 pods (100%), (2) Ingress-based: nginx-ingress canary annotations: nginx.ingress.kubernetes.io/canary: 'true'; nginx.ingress.kubernetes.io/canary-weight: '10' sends 10% to canary service, (3) Service mesh (Istio): VirtualService with traffic split: route: [{destination: {host: myapp, subset: v1}, weight: 90}, {destination: {host: myapp, subset: v2}, weight: 10}]. Progressive delivery: automate canary analysis (monitor error rates, latency), auto-promote or rollback based on metrics (Flagger tool). Best practices: (1) Start with 5-10% canary traffic, (2) Monitor key metrics (error rate, latency P95, success rate), (3) Gradual increase: 10% → 25% → 50% → 100% over 30-60 minutes, (4) Automated rollback on metric threshold breach. Use blue-green for: low-traffic apps, database migrations (switch atomically). Use canary for: high-traffic apps, gradual validation, automated analysis. 2025 trend: canary with automated analysis (Flagger, Argo Rollouts) becoming standard for production deployments.

99% confidence
A

HPA automatically scales pod replicas based on observed metrics. Mechanism: Controller queries metrics-server every 15s, calculates desired replicas: ceil(current * (current_metric / target_metric)), updates Deployment/StatefulSet spec. Kubernetes 1.29+ supports autoscaling/v2 API with multiple metric sources. Built-in: CPU/memory via kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10. Custom metrics (via Prometheus Adapter or KEDA): HTTP request rate (scale at 100-500 req/sec per pod), queue depth (RabbitMQ, Kafka consumer lag at 1000+ messages), business metrics (active connections, concurrent queries). Multi-metric HPA chooses metric requiring most pods. Best practices: (1) minReplicas 2+ (HA), (2) CPU target 70-80% (headroom for spikes), (3) behavior field limits scale-down: stabilizationWindowSeconds: 300 prevents thrashing, (4) combine with VPA for right-sizing. Common pitfalls: Slow-starting apps without readiness probes, no cooldown causing rapid cycles, insufficient cluster capacity blocks scale-up. Use Cluster Autoscaler for node-level scaling.

99% confidence
A

Requests/limits control pod resource allocation and Quality of Service (QoS) class. Requests: Guaranteed minimum, used for scheduling - node must have available ≥ requests. Limits: Maximum consumption, enforced by kernel cgroups. Config: resources: {requests: {cpu: '250m', memory: '256Mi'}, limits: {cpu: '500m', memory: '512Mi'}}. QoS classes (eviction priority): (1) Guaranteed - requests == limits, highest priority, last evicted. (2) Burstable - requests < limits, medium priority. (3) BestEffort - no requests/limits, first evicted. Scheduling: kube-scheduler sums requests per node, schedules on available capacity. Over-commitment: Total limits can exceed node capacity (relies on pods not hitting limits simultaneously). CPU throttling: Pod hitting limit gets throttled. Memory limit: OOMKill (pod restart). Best practices: Set requests from average usage, limits at 2x requests for burst. Monitor with metrics-server/Prometheus, use LimitRanges for namespace defaults, Guaranteed QoS for critical pods. Common mistakes: No requests (can't schedule), limits too low (OOMKills), no limits (starvation). Use Vertical Pod Autoscaler for optimal values.

99% confidence
A

Service mesh provides observability, security, traffic management without modifying app code. Istio 1.24+ (2024 GA) offers ambient mode (40-50% lower resource overhead vs sidecars). Ambient mode: Layer 4 ztunnel per-node instead of per-pod, opt-in Layer 7 waypoint proxies. Sidecar mode: Envoy proxy injected per pod via mutating webhook. Core capabilities: (1) Traffic management - A/B testing, canary deployments (10%→50%→100%), circuit breaking, retries via VirtualService/DestinationRule CRDs. (2) Security - Automatic mTLS between services (STRICT/PERMISSIVE), certificate rotation every 24h, L7 authorization policies (JWT claims, HTTP methods). (3) Observability - Distributed tracing (Jaeger/Zipkin, 1% sampling), RED metrics (Rate/Errors/Duration) to Prometheus. Install: istioctl install --set profile=ambient (or --set profile=default for sidecar), enable namespace: istio-injection=enabled. Trade-offs: Sidecar adds 50-100MB memory + 0.05-0.1 vCPU per pod, ambient reduces to 10-20MB. P50 latency +0.5-1ms. Use cases: Essential for >20 microservices needing unified security/observability. 2025 adoption: 35% large enterprises, ambient mode driving renewal.

99% confidence
A

GitOps uses Git as single source of truth for declarative infrastructure, enabling automated deployments via pull-based reconciliation. Principle: Git repo contains all Kubernetes manifests, operators (ArgoCD, Flux) continuously sync cluster state to match repo. Workflow: Developers push changes → ArgoCD/Flux detects → applies manifests → reconciles differences (self-healing). ArgoCD implementation: Install: kubectl apply -n argocd -f install.yaml. Create Application: kind: Application; spec: {source: {repoURL, path: 'k8s', targetRevision: HEAD}, destination: {server, namespace}, syncPolicy: {automated: {prune: true, selfHeal: true}}}. Benefits: (1) Audit trail in Git history, (2) Rollback via git revert, (3) Disaster recovery from Git, (4) Multi-cluster management (100+ clusters from single repo). Patterns: Environment branches (dev/staging/prod), app-of-apps (ArgoCD app creates other apps), Helm/Kustomize integration. Security: All changes via Git (PR review), RBAC on repos. Trade-offs: Learning curve, 30s-3min sync delay. Best practice: Separate app code from GitOps config repo, use automated image updaters. 2025 adoption: 65% of enterprises implement GitOps.

99% confidence
A

Kubernetes probes detect unhealthy pods and control traffic routing, critical for zero-downtime deployments. Three probe types: (1) Liveness - Detects if pod alive, kubelet restarts on failure. Use to recover from deadlocks/hangs. (2) Readiness - Detects if pod can serve traffic, removes from Service endpoints on failure. Use during startup or temporary unavailability. (3) Startup - Allows slow-starting pods extended time before liveness. Prevents premature kills. Config: livenessProbe: {httpGet: {path: /healthz, port: 8080}, initialDelaySeconds: 30, periodSeconds: 10, failureThreshold: 3}. Methods: httpGet (200-399 success), tcpSocket, exec (exit 0). Best practices: (1) Liveness lightweight - avoid dependencies/expensive ops, only check if process responds. (2) Readiness checks dependencies - DB, downstream services. (3) Different endpoints: /livez (liveness), /readyz (readiness). (4) Tune thresholds: failureThreshold 3-5 allows transient failures. Common mistakes: Same endpoint for both (DB down → liveness fails → all pods restart), no startup probe for slow apps, too aggressive timeouts. Example: Java app startup probe with failureThreshold: 30, periodSeconds: 5 gives 150s startup window.

99% confidence
A

NetworkPolicies provide Layer 3/4 firewall rules for pod-to-pod communication, enabling zero-trust security. Default: All pods communicate freely. NetworkPolicies enable explicit allow-list model. Requires CNI plugin: Calico 3.27+ (40% market share), Cilium 1.15+ (eBPF, fastest), Weave, Antrea. kubenet/Flannel DON'T enforce policies. Zero-trust steps: (1) Audit mode - Deploy Cilium audit mode or Calico recommendation tool, observe traffic 7-14 days. (2) Default deny - Apply deny-all NetworkPolicy per namespace (empty podSelector + empty ingress/egress). (3) Explicit allows - Create granular policies per service. (4) Namespace isolation - Use namespaceSelector labels (env: production, team: payments). (5) External egress - ipBlock rules for specific external IPs, block general internet. Example three-tier: Frontend → backend:8080, Backend → database:5432, Database → deny all egress. Advanced: DNS egress (kube-dns port 53), Prometheus scraping, FQDN-based (Cilium: toFQDNs matchName: api.stripe.com). Best practices: Start audit mode, apply default-deny incrementally, label pods consistently, test in staging. Performance: Cilium <5µs latency, Calico ~10-20µs. Scale: 10K pods with 1K+ policies. 2025 adoption: 65% enterprises (up from 50% in 2024).

99% confidence
A

GitOps uses Git as single source of truth for declarative infrastructure and applications. Core principles: (1) Declarative - all Kubernetes manifests stored in Git (YAML/JSON), (2) Versioned - Git commits provide audit trail, rollback capability (git revert reverts cluster state), (3) Pull-based - operators (ArgoCD, Flux) continuously sync cluster state to match Git repo, (4) Automated reconciliation - self-healing, detects drift and auto-corrects. Workflow: Developers push changes to Git → Operator detects changes → Applies manifests to cluster → Reconciles differences. Benefits: audit trail (all changes in Git history), disaster recovery (restore cluster from Git), consistency (prevents kubectl drift), multi-cluster management (manage 100+ clusters from single repo). 2025 adoption: 65%+ enterprise organizations use GitOps.

99% confidence
A

ArgoCD implementation: (1) Install ArgoCD: kubectl apply -n argocd -f install.yaml. (2) Create Application resource: apiVersion: argoproj.io/v1, kind: Application, spec: {source: {repoURL: 'github.com/org/app', path: 'k8s', targetRevision: HEAD}, destination: {server: 'https://kubernetes.default.svc', namespace: default}, syncPolicy: {automated: {prune: true, selfHeal: true}}}. Prune deletes resources not in Git, selfHeal auto-corrects drift. (3) ArgoCD continuously polls Git, applies changes automatically. Patterns: (1) Environment branches (dev/staging/prod), (2) App-of-apps (ArgoCD app creates other apps, manages platform), (3) Helm/Kustomize integration (renders templates before applying). UI shows sync status, health, history. Security: restrict cluster access, all changes via Git (PR review required). Sync delay: ~30s-3min pull-based.

99% confidence
A

Flux implementation: (1) Install Flux: flux bootstrap github --owner=org --repository=fleet --path=clusters/production --personal. Creates Git repo structure, installs Flux controllers. (2) Flux uses GitRepository CRD to watch repo, Kustomization CRD to apply manifests. (3) Continuous reconciliation: Flux polls Git every 1 minute (configurable), applies changes automatically. (4) Structure: clusters/production/ contains Kustomization files, apps/ contains application manifests. Flux features: (1) Multi-tenancy - namespace isolation per team, (2) Notification system - alerts to Slack/Teams on deployments, (3) Image automation - updates manifests when new container images pushed. Patterns: separate app code repo from GitOps config repo, use flux image automation to trigger config updates. Flux reconciles faster than ArgoCD (1 min default vs 3 min), but less UI visibility (CLI-focused).

99% confidence