kubernetes_container_orchestration 40 Q&As

Kubernetes Container Orchestration FAQ & Answers

40 expert Kubernetes Container Orchestration answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

40 questions
A

Pod is the smallest deployable unit: group of 1+ containers with shared storage/network and specification for running containers. Pods are ephemeral, not recreated after failure. Deployment manages set of Pods providing declarative updates for Pods and ReplicaSets. Deployment controls replicas, rolling updates, rollbacks, scaling. Best practice: create Pods using Deployments/Jobs, not directly. For stateful apps, use StatefulSet instead.

99% confidence
A

StatefulSet maintains sticky identity for each Pod with persistent identifier across rescheduling. Requires Headless Service for network identity. Ordered deployment/scaling/deletion (sequential). Storage persists when StatefulSet deleted/scaled down. For stateful apps (databases, message queues). Deployment creates interchangeable stateless replicas. No stable network identity. Parallel pod creation. For stateless apps (web servers, APIs). If no stable identifiers needed, use Deployment.

99% confidence
A

ClusterIP (default): exposes service on cluster-internal IP, only reachable within cluster. Use for internal microservices. NodePort: exposes service on each Node's IP at static port (30000-32767 range), automatically creates ClusterIP. Use for development/testing, external access without load balancer. LoadBalancer: provisions cloud provider load balancer, automatically creates NodePort and ClusterIP. Use for production external access on cloud platforms (AWS/GCP/Azure).

99% confidence
A

Horizontal Pod Autoscaler (HPA, 2025): Automatically scales workload resources (Deployment, StatefulSet, ReplicaSet) by adjusting replica count to match demand - horizontal scaling adds/removes Pods in response to load changes. How HPA works (reconciliation loop): (1) Metrics collection: HPA controller queries metrics every 15 seconds (default, configurable via --horizontal-pod-autoscaler-sync-period). Metrics API (metrics.k8s.io/v1beta1) provides resource metrics (CPU/memory) from metrics-server. Custom metrics API (custom.metrics.k8s.io/v1beta1) for app-specific metrics (requests/sec, queue depth). External metrics API (external.metrics.k8s.io/v1beta1) for cloud provider metrics (AWS SQS queue length, GCP Pub/Sub backlog). (2) Calculation: Desired replicas = ceil(current replicas × (current metric / target metric)). Example: 3 replicas @ 80% CPU, target 50% → 3 × (80/50) = 4.8 → 5 replicas. Uses average across all pods (sum of pod metrics / number of pods). (3) Scaling decision: Scales up immediately when metric exceeds target. Scales down after stabilization window (5 min default, --horizontal-pod-autoscaler-downscale-stabilization-window) to prevent flapping. Respects min/max replicas (spec.minReplicas: 2, spec.maxReplicas: 10). (4) Apply change: Updates target resource's spec.replicas field via scale subresource. Metric types (2025): (1) Resource metrics (CPU/memory): Based on pod resource requests - targetAverageUtilization: 70 (70% of requested CPU), targetAverageValue: 500m (500 milliCPU absolute). Requires resource requests defined in pod spec. (2) ContainerResource metrics (Kubernetes 1.30+ stable): Scale on individual container metrics within multi-container pods - type: ContainerResource, container: sidecar-proxy, target: averageUtilization: 80. Solves sidecar problem where sidecar uses 90% CPU but app container only 30%. (3) Custom metrics (via Prometheus Adapter, Datadog, etc.): Application metrics - http_requests_per_second, active_connections, queue_depth. Example: scale based on custom metric 'requests_per_pod' with target 1000 RPS/pod. (4) External metrics (cloud provider integrations): AWS CloudWatch (SQS ApproximateNumberOfMessages), GCP Monitoring (Pub/Sub num_undelivered_messages), Azure Monitor. Decouples scaling from pod metrics - scale consumers based on producer queue length. Configuration example (2025): HPA with CPU (70% util), memory (1.5Gi avg), custom metric (requests/sec target 500), min 2 replicas, max 20 replicas. Behavior policy: scaleDown stabilizationWindowSeconds: 300, policies: periodSeconds 60, type Pods, value 2 (max 2 pods down per minute). scaleUp policies: type Percent, value 50, periodSeconds 60 (max 50% increase per minute). Production best practices (2025): (1) Always define resource requests: HPA cannot function without requests - CPU/memory requests mandatory for resource-based autoscaling. (2) Multiple metrics: Combine CPU + memory + custom (scale on whichever breaches first) - prevents CPU-bound scaling ignoring memory exhaustion. (3) Behavior policies: Control scaling velocity to prevent thrashing - slow scale-down (5 min stabilization), moderate scale-up (2x in 2 min max). (4) PodDisruptionBudget: Ensure HPA doesn't scale below minimum required for availability - PDB minAvailable: 2 prevents HPA scaling to 1 replica. (5) Load testing: Validate HPA thresholds under realistic traffic - simulate traffic spikes, measure scaling latency (metrics delay + pod startup time). (6) Monitor scaling events: kubectl describe hpa shows scaling history - track 'ScalingReplicaSet' events, insufficient metrics warnings. Common issues (2025): (1) Unknown metrics: HPA shows 'unable to get metrics' - verify metrics-server running (kubectl get deployment metrics-server -n kube-system), check pod resource requests defined. (2) Flapping: Rapid scale up/down cycles - increase stabilization window, add behavior policies (selectPolicy: Max, Min, Disabled). (3) CPU throttling before scaling: Pods throttled at request limit before HPA triggers - set requests lower than typical usage (request: 500m, limit: 2000m, target: 60% of request = 300m trigger). Advanced: Predictive autoscaling (2025): KEDA (Kubernetes Event-Driven Autoscaling) adds cron-based scaling (scale up before known traffic spike), external scaler plugins (Redis, Kafka, MongoDB, MySQL, RabbitMQ, AWS SQS/Kinesis). Combine HPA (reactive) + KEDA (predictive) for optimal scaling. Kubernetes 1.30 ContainerResource metrics (stable): Critical for sidecar patterns - service mesh (Istio/Linkerd sidecar uses 50% CPU, app uses 10% - scale on app container only, not sidecar noise). Logging agents (Fluent Bit sidecar CPU spikes during log bursts don't trigger unnecessary scaling). Cost optimization: HPA reduces over-provisioning by 40-60% - right-size capacity to actual demand, scale to min replicas during off-peak (2 replicas 2am-6am, 10 replicas 9am-5pm). Scaling latency: Metrics collection (15s) + decision (1s) + pod startup (30-120s depending on image size/init containers) = 45-135s total scale-up latency. Use readiness probes to prevent premature traffic routing.

99% confidence
A

ConfigMap stores non-confidential configuration data as key-value pairs separate from application code. Use for config files, environment variables, command-line arguments. Secrets store confidential data (passwords, tokens, keys). Warning: Secrets stored unencrypted in etcd by default - enable encryption at rest, RBAC, and consider external Secret providers. ConfigMaps can be immutable (prevents accidental updates). Use Secrets for sensitive data, ConfigMaps for non-sensitive config.

99% confidence
A

DaemonSet (2025): Kubernetes workload controller ensuring specific pod runs on all (or subset of) nodes in cluster - as nodes added, DaemonSet pods automatically scheduled; as nodes removed, pods garbage collected. Scheduling behavior (2025): (1) Automatic node coverage: DaemonSet controller creates pod for each node matching nodeSelector (if specified) or all nodes (default). Pods scheduled by default scheduler (Kubernetes 1.12+) respecting node affinity, taints/tolerations, resource requests. (2) Node selection: nodeSelector (simple label matching, example: disktype=ssd), nodeAffinity (advanced rules with requiredDuringScheduling, preferredDuringScheduling), podAntiAffinity (avoid co-location with other pods). (3) Taint toleration: DaemonSets automatically tolerate node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/disk-pressure, node.kubernetes.io/memory-pressure, node.kubernetes.io/pid-pressure, node.kubernetes.io/network-unavailable (for network plugins), node.kubernetes.io/unschedulable (for ScheduleDaemonSetPods). Enables critical system pods to run even on tainted nodes. (4) Update strategy: RollingUpdate (default, max unavailable controlled, gradual rollout) or OnDelete (manual pod deletion triggers update). Primary use cases (2025): (1) Cluster storage daemons: Distributed storage systems requiring node-level agents - Ceph (ceph-osd for block storage), GlusterFS (glusterd for distributed filesystem), Rook (storage orchestrator), Longhorn (cloud-native block storage). Each node runs storage daemon managing local disks. (2) Log collection & aggregation: Node-level log collectors shipping logs to centralized system - Fluentd (CNCF graduated, tails /var/log/containers/*.log, ships to Elasticsearch/S3), Fluent Bit (lightweight, 450KB memory footprint vs Fluentd 40MB), Logstash (Elastic Stack), Vector (Rust-based, 10x faster than Fluentd). Accesses host filesystem via hostPath volume mount (/var/log, /var/lib/docker/containers). (3) Node monitoring & metrics: System-level metrics exporters for observability - Prometheus Node Exporter (CPU, memory, disk, network metrics from /proc, /sys), Datadog agent (APM + infrastructure monitoring), New Relic infrastructure agent, Dynatrace OneAgent, collectd (Unix daemon collecting system metrics). Exposes node metrics on well-known port (9100 for Node Exporter). (4) Security & compliance: Continuous security scanning and enforcement - Falco (runtime threat detection, syscall monitoring via eBPF), kube-bench (CIS Kubernetes Benchmark compliance checks), Twistlock Defender (container runtime protection), Aqua Security enforcer, Sysdig Secure agent. Privileged access required (hostPID: true, hostNetwork: true, securityContext.privileged: true). (5) Cluster networking: CNI plugins requiring node-level network configuration - Calico (felix daemon for policy enforcement + BGP routing), Cilium (cilium-agent for eBPF-based networking), Weave Net (weave daemon for overlay network), Flannel (flanneld for subnet allocation). Manages iptables rules, routes, network namespaces on each node. (6) Device plugins: Hardware resource management - NVIDIA GPU device plugin (exposes nvidia.com/gpu resource), Intel SR-IOV network device plugin, FPGA device plugin. Discovers and advertises specialized hardware to kubelet. Configuration example (2025): DaemonSet with nodeSelector (disktype: ssd), resource requests (cpu: 100m, memory: 200Mi), resource limits (cpu: 200m, memory: 400Mi), updateStrategy RollingUpdate with maxUnavailable: 1, priorityClassName: system-node-critical (prevents eviction during node pressure). Production best practices (2025): (1) Resource limits: Always set requests/limits - prevents DaemonSet pods from starving application pods (request: 50m CPU, 100Mi memory typical for lightweight agents). (2) Priority class: Use system-node-critical or system-cluster-critical - ensures DaemonSet pods not evicted during resource pressure (preempts lower-priority pods instead). (3) Health checks: Liveness probe (restart unhealthy agent) + readiness probe (don't route traffic until agent ready, relevant for network plugins). (4) Security context: Minimize privileges - avoid privileged: true if possible, use specific capabilities (CAP_NET_ADMIN for networking, CAP_SYS_ADMIN for monitoring), readOnlyRootFilesystem: true when feasible. (5) Update strategy: RollingUpdate with maxUnavailable: 1 or 10% - prevents simultaneous disruption across all nodes (critical for CNI plugins where network outage affects entire node). (6) Host namespace access: Only when required - hostNetwork: true for CNI plugins (need access to host network stack), hostPID: true for monitoring agents (need visibility into all processes). Common patterns (2025): Log collector: Fluent Bit DaemonSet with hostPath mount (/var/log:/var/log, /var/lib/docker/containers:/var/lib/docker/containers), resource limit 100m CPU / 200Mi memory, ships logs to Elasticsearch via outputs.conf ConfigMap. Monitoring: Prometheus Node Exporter with hostPath mount (/proc:/host/proc, /sys:/host/sys), hostNetwork: true (binds to node IP:9100), tolerates all taints (runs on control plane + worker nodes). Networking: Calico DaemonSet with privileged: true, hostNetwork: true, hostPID: true, manages iptables/ipvs via calico-node container. Limitations (2025): DaemonSets respect node selectors but NOT pod count limits - always 1 pod per matching node (cannot run 0 or 2+ pods per node). For multi-instance per node, use multiple DaemonSets with different nodeAffinity. Scheduling edge cases: Pods not scheduled on nodes with NoSchedule taint unless DaemonSet includes matching toleration - verify taints with kubectl describe node | grep Taints before deploying DaemonSet.

99% confidence
A

Kubernetes Ingress (2025): API object managing external HTTP/HTTPS access to cluster services - provides load balancing, SSL/TLS termination, name-based virtual hosting, path-based routing. Alternative to creating multiple LoadBalancer services (expensive, requires separate cloud load balancer per service). Single Ingress exposes multiple services via routing rules. Key features: (1) Path-based routing: Route by URL path - /api → backend-service, /web → frontend-service, /admin → admin-service. (2) Host-based routing: Route by hostname - api.example.com → api-service, www.example.com → web-service. (3) TLS termination: HTTPS handled at ingress, backends receive HTTP - reduces cert management complexity (single cert/wildcard at ingress vs per-service certs). (4) Load balancing: Distributes traffic across service pods using round-robin, least-connections, or IP hash. Ingress Controller (required): Ingress resource is declarative config only - requires Ingress Controller (daemon watching Ingress resources, configuring reverse proxy to implement rules). Not built into Kubernetes - must deploy separately. NGINX Ingress Controller (2025): Community-maintained controller (kubernetes.github.io/ingress-nginx, CNCF project) using NGINX as reverse proxy/load balancer. Architecture: (1) Controller pod: Watches Ingress resources + Services + Endpoints via Kubernetes API, generates NGINX config from Ingress rules, reloads NGINX when config changes. (2) NGINX process: Handles HTTP/HTTPS traffic, proxies to backend service ClusterIPs, terminates TLS using Secret-stored certificates. (3) ConfigMap: Global NGINX configuration (timeouts, buffer sizes, rate limits) in ingress-nginx-controller ConfigMap. (4) Admission webhook: Validates Ingress manifests before admission (prevents misconfigurations). Deployment patterns (2025): (1) Cloud LoadBalancer: NGINX Ingress deployed as LoadBalancer Service - cloud provider provisions ELB/ALB (AWS), GLB (GCP), Load Balancer (Azure). Traffic flow: Internet → Cloud LB → NGINX Ingress pod → Service → Pods. (2) NodePort: NGINX exposed via NodePort 30000-32767 - for on-prem/bare-metal without cloud LB. (3) DaemonSet + HostNetwork: NGINX runs on every node via DaemonSet with hostNetwork: true - binds to ports 80/443 on node IP, no kube-proxy overhead. External LB (HAProxy, F5, MetalLB) fronts nodes. Configuration example (2025): Ingress with host: api.example.com, TLS secret: api-tls-cert, paths: /v1 → api-v1-service:8080, /v2 → api-v2-service:8080. Annotations: nginx.ingress.kubernetes.io/rewrite-target: /, nginx.ingress.kubernetes.io/ssl-redirect: true (force HTTPS), nginx.ingress.kubernetes.io/rate-limit: 100 (100 req/sec). Popular Ingress Controllers (2025): (1) NGINX Ingress (kubernetes.github.io/ingress-nginx): Most widely used, mature, extensive annotations (30+), supports WebSocket/gRPC, canary deployments. (2) Traefik: Auto-discovery, dynamic config, native Let's Encrypt integration, middleware plugins (rate limit, circuit breaker). (3) HAProxy Ingress: High performance (1M+ concurrent connections), advanced routing (ACLs), sticky sessions. (4) AWS ALB Ingress: Native AWS ALB integration, path-based routing in ALB, target groups per service. (5) Istio Ingress Gateway: Service mesh integration, advanced traffic management (A/B testing, traffic mirroring), mTLS. (6) Contour: VMware-backed, uses Envoy proxy, HTTPProxy CRD for advanced features. NGINX vs NGINX Inc versions: (1) kubernetes.github.io/ingress-nginx (community): Free, CNCF project, uses Ingress resource, supports Kubernetes 1.19+, maintained by Kubernetes SIG Network. (2) nginxinc/kubernetes-ingress (NGINX Inc): Commercial support available, uses VirtualServer/VirtualServerRoute CRDs, NGINX Plus features (JWT auth, active health checks, advanced load balancing), compatible with NGINX App Protect WAF. Production best practices (2025): (1) TLS certificates: Use cert-manager for automated Let's Encrypt certs - annotate Ingress with cert-manager.io/cluster-issuer: letsencrypt-prod, cert-manager auto-creates/renews certs. (2) Resource limits: Set NGINX pod limits (cpu: 1000m, memory: 2Gi) to prevent OOM during traffic spikes. (3) Replicas: Run 2+ NGINX Ingress replicas across zones for HA - PodDisruptionBudget minAvailable: 1. (4) Rate limiting: Protect backends with nginx.ingress.kubernetes.io/limit-rps: 10, limit-connections: 100. (5) Monitoring: Prometheus metrics endpoint (/metrics on port 10254), track request rate, error rate, latency p99. (6) Security: Enable WAF via ModSecurity (nginx.ingress.kubernetes.io/enable-modsecurity: true), restrict client IPs (nginx.ingress.kubernetes.io/whitelist-source-range: 10.0.0.0/8). Common annotations (2025): rewrite-target (URL rewriting), ssl-redirect (force HTTPS), proxy-body-size (upload limit, default 1m), backend-protocol (HTTPS/GRPC/FCGI), auth-url (external auth), canary-weight (traffic splitting for blue-green). Limitations: Ingress API limited to HTTP/HTTPS layer 7 (no TCP/UDP) - use Gateway API (successor to Ingress, graduated beta 2023) for advanced routing, or Service type LoadBalancer for non-HTTP protocols. Performance: NGINX Ingress handles 50K-100K req/sec per pod (4 CPU cores), sub-5ms latency p50. Scales horizontally via replicas. Troubleshooting: kubectl logs -n ingress-nginx deploy/ingress-nginx-controller (NGINX reload errors, backend connection failures), kubectl describe ingress (shows events, backend health), curl ingress-ip/healthz (controller health).

99% confidence
A

Role: grants permissions within single namespace. ClusterRole: grants permissions cluster-wide or across namespaces. RoleBinding: grants permissions defined in Role to users/groups/service accounts within namespace. ClusterRoleBinding: grants permissions cluster-wide. RBAC uses rbac.authorization.k8s.io API group, allows dynamic policy configuration via Kubernetes API. Key security control ensuring cluster users/workloads have only required access. Best practice: principle of least privilege.

99% confidence
A

PersistentVolume (PV): piece of storage in cluster provisioned by admin or dynamically via Storage Classes. Cluster resource independent of pods. PersistentVolumeClaim (PVC): request for storage by user. Specifies size and access modes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany, ReadWriteOncePod). PV is cluster resource, PVC is request for and claim on that resource. Abstracts how storage provided from how it's consumed.

99% confidence
A

Kubernetes 1.30 (Uwubernetes, April 2024): Major release with 58 enhancements focusing on security isolation, autoscaling precision, and authorization flexibility. Top features (2025 production impact): (1) User Namespaces support (Beta): Isolates container users from host users by mapping container UIDs/GIDs to different host ranges (container root UID 0 maps to host UID 100000-165535). Enabled with pod spec: hostUsers: false. Supports pods with/without volumes, custom UID/GID ranges via RunAsUser/RunAsGroup. Security benefit: prevents container escape attacks (CVE-2022-0185 style vulnerabilities) - even if attacker gains root in container, they have no privileges on host. Requirements: Linux kernel 5.8+, systemd 247+, containerd 1.7+/CRI-O 1.25+. Limitations: doesn't work with hostNetwork, hostPID, hostIPC (conflicts with namespace isolation). Production use case: multi-tenant clusters running untrusted workloads (SaaS platforms, CI/CD runners) - isolates tenant containers from each other and host. (2) Structured Authorization Configuration (Beta, enabled by default): Replaces --authorization-mode flag with structured config file defining authorization chain (RBAC + Webhook + Node authorizer). Enables dynamic authorization without API server restart, supports multiple webhooks with failure policies (Deny/NoOpinion). Example: authorization chain with RBAC (cluster-local permissions) followed by external OPA webhook (policy as code), fallback to Node authorizer (kubelet access). Configuration: apiVersion: apiserver.config.k8s.io/v1beta1, kind: AuthorizationConfiguration, authorizers: type RBAC, type Webhook (connectionInfo url, failurePolicy). Production benefit: centralized policy management (audit all authorization decisions), gradual rollout of new policies (Webhook in audit mode before enforcement). (3) HPA ContainerResource metrics (Stable, GA): Autoscale based on individual container resource usage instead of pod aggregate (solves sidecar noise problem). Example: multi-container pod with app (30% CPU) + istio-proxy sidecar (70% CPU) - previous HPA scales on pod total (50% avg), new HPA scales on app container only (30%, preventing over-scaling). Configuration: HPA v2 spec.metrics type: ContainerResource, containerResource: name: cpu, container: app, target.averageUtilization: 70. Critical for service mesh deployments where sidecar CPU usage doesn't correlate with app load. Graduated from alpha (1.20) → beta (1.27) → stable (1.30). Production impact: reduces 20-40% over-provisioning in service mesh environments (sidecar CPU spikes during TLS handshakes don't trigger unnecessary scaling). Additional Kubernetes 1.30 features: (4) ReadWriteOncePod PV access mode (GA): Ensures volume mounted by single pod only across entire cluster (stricter than ReadWriteOnce which allows multiple pods on same node). Use case: databases requiring exclusive storage access (PostgreSQL, MySQL with local SSD). (5) AppArmor support (GA): Native AppArmor profile enforcement via pod security context (container.appArmorProfile: type RuntimeDefault/Localhost/Unconfined, localhostProfile: my-profile). Previously required annotations (deprecated in 1.30). Mandatory Access Control (MAC) for container processes (restricts file access, network operations, capabilities). (6) MinDomains in PodTopologySpread (Beta): Ensures minimum number of topology domains receive pods (prevents zone imbalance during node failures). Example: 3-zone cluster with minDomains: 3 ensures pods in all zones even if one zone has fewer nodes. (7) CEL for Admission Control (Stable): ValidatingAdmissionPolicy using Common Expression Language (CEL) for in-process validation without external webhooks (10x faster than webhook-based validation, <1ms vs 10-50ms). Example: enforce resource limits policy - object.spec.containers.all(c, has(c.resources.limits)). (8) Sleep action for Container Lifecycle hooks (Alpha): preStop hook with sleep: seconds: 30 (delays SIGTERM, drains connections gracefully). Replaces custom shell script sleep hacks. (9) Contextual logging (Alpha): Structured logging with request context propagation (trace IDs through controller chains). Version compatibility (2025): Kubernetes 1.30 supports: (1) etcd 3.5.x (required, 3.4.x deprecated). (2) CoreDNS 1.11.x (bundled). (3) CRI-O 1.30.x, containerd 1.7.x (container runtimes). (4) Go 1.22.x (for custom controllers). (5) kubectl 1.30 compatible with API server versions 1.29, 1.30, 1.31 (skew policy: ±1 minor version). Deprecations and removals: (1) v1beta2 FlowSchema/PriorityLevelConfiguration (removed): Migrated to flowcontrol.apiserver.k8s.io/v1 (APF v1 stable). (2) status.nodeInfo.kubeProxyVersion field (deprecated): No longer updated (kube-proxy version not always accurate in custom CNI setups). (3) CSI Migration for AWS EBS, GCE PD, Azure Disk (completed): In-tree cloud volume plugins removed, CSI drivers mandatory (ebs.csi.aws.com, pd.csi.storage.gke.io, disk.csi.azure.com). Upgrade considerations: (1) Test user namespaces in staging (hostUsers: false) before production (verify volume permissions, check app compatibility with remapped UIDs). (2) Migrate authorization config to structured format (prepare for --authorization-mode flag deprecation in future versions). (3) Update HPA to use ContainerResource metrics for service mesh workloads (review all HPAs with multi-container pods). (4) Switch to CSI drivers if still using in-tree cloud providers (aws-ebs, gce-pd deprecated). (5) Update custom admission webhooks to ValidatingAdmissionPolicy with CEL (performance improvement + reduced operational overhead). Performance improvements: Scheduler throughput increased 15% (better queue management), API server memory usage reduced 10% (optimized cache eviction), kubelet CPU usage reduced 8% (improved pod status sync). Production adoption timeline: Kubernetes 1.30 supported until February 2025 (12-month support window), production-ready for early adopters (June 2024+), enterprise adoption (September 2024+), cloud provider managed offerings (GKE 1.30 GA August 2024, EKS 1.30 GA September 2024, AKS 1.30 GA September 2024). Security hardening: User namespaces eliminate entire class of container escape vulnerabilities (2024 CVEs: CVE-2024-21626 runc escape, CVE-2024-3094 XZ backdoor - mitigated with UID remapping). Key takeaway: Kubernetes 1.30 prioritizes security isolation (user namespaces), precision autoscaling (container-level HPA), and operational flexibility (structured authorization) - essential upgrades for multi-tenant and service mesh environments.

99% confidence
A

Kubernetes 1.29 (Mandala, Dec 2023) key features: (1) QueueingHint - optimizes scheduling efficiency by reducing useless requeueing retries up to 90%, improves cluster performance at scale. (2) In-tree cloud provider removal - defaults to operating without built-in cloud provider integration (AWS, GCP, Azure moved to external CCM). (3) LegacyServiceAccountTokenCleanUp - labels legacy auto-generated secret-based service account tokens as invalid if unused for 1+ years, significantly reducing attack surface for compromised tokens. (4) ReadWriteOncePod PV access mode graduates to stable - ensures single pod exclusivity for storage. (5) Node log query via kubelet API (alpha) - query node-level logs without SSH access. (6) KMS v2 improvements - enhanced encryption at rest performance. Focus on security hardening (token cleanup, encryption), scheduling efficiency (90% retry reduction), cloud-native architecture (external CCM). 45 enhancements total. Production impact: reduced scheduler overhead, improved security posture, cloud-agnostic deployments.

99% confidence
A

Helm (2025): Package manager for Kubernetes enabling templatized, reusable, versioned application deployments. Alternative to raw kubectl apply with static YAML manifests (eliminates duplication, enables environment-specific configuration, manages application lifecycle). Core concepts: (1) Chart: Package containing Kubernetes resource templates (Deployment, Service, Ingress), values.yaml (default config), Chart.yaml (metadata: name, version, dependencies), helpers.tpl (reusable template functions). Distributed as .tgz archives via HTTP repositories (Artifact Hub, private ChartMuseum, OCI registries). (2) Release: Installed instance of chart with specific name and values (example: myapp-prod release from nginx chart with values prod-values.yaml). Single chart creates multiple releases (nginx-dev, nginx-staging, nginx-prod with different configurations). (3) Repository: Collection of packaged charts accessible via index.yaml (Artifact Hub has 10,000+ public charts: bitnami/postgresql, prometheus-community/kube-prometheus-stack, ingress-nginx/ingress-nginx). (4) Values: Configuration parameters overriding chart defaults (hierarchy: chart values.yaml < environment values.yaml < --set CLI flags). Helm architecture (Helm 3, 2019+): Client-only (no Tiller server, direct kubectl access), stores release metadata as Secrets in target namespace (helm.sh/release.v1 secrets), three-way merge for upgrades (last applied, current live, desired state). Removed Helm 2 security issues (no privileged Tiller pod with cluster-admin RBAC). Production best practices (2025): (1) Semantic Versioning (SemVer): Chart version follows MAJOR.MINOR.PATCH (example: 2.5.3). MAJOR: breaking changes (API removal, value restructure). MINOR: backward-compatible features (new templates, optional values). PATCH: bug fixes, no config changes. App version separate in Chart.yaml (tracks application version: nginx:1.25.3, chart version: 3.2.1). Breaking change example: bump chart 2.x → 3.0.0 when changing service.port to service.http.port (requires user config update). (2) Immutable releases: Never modify deployed release directly (kubectl edit bad practice). Use helm upgrade with new values or chart version. Enables rollback (helm rollback myapp 3 returns to revision 3), maintains audit trail (helm history myapp shows all revisions with dates/status). (3) Values organization: Structure values.yaml hierarchically - global values (replicaCount, image, service), component-specific (postgresql.persistence.size, redis.cluster.enabled). Use _helpers.tpl for computed values (example: fullname template combines release name + chart name, labels template generates standard labels). Avoid hardcoding in templates - extract to values (bad: image: nginx:1.25, good: image: {{ .Values.image.repository }}:{{ .Values.image.tag }}). (4) Dependency management: Declare dependencies in Chart.yaml (dependencies: name: postgresql, version: 12.x.x, repository: https://charts.bitnami.com/bitnami, condition: postgresql.enabled). Run helm dependency update to download deps to charts/ directory. Allows disabling dependencies (values: postgresql.enabled: false for external DB). Version constraints: 1.2.3 (exact), ~1.2.3 (patch updates: >=1.2.3 <1.3.0), ^1.2.3 (minor updates: >=1.2.3 <2.0.0), * (any version, avoid in production). (5) Template functions best practices: Use default function for optional values (example: replicaCount: {{ .Values.replicaCount | default 3 }}). Quote strings to prevent YAML parsing issues (example: name: {{ .Values.name | quote }}). Use toYaml for nested structures (example: resources: {{ toYaml .Values.resources | nindent 12 }}). Conditionals for optional resources (example: {{- if .Values.ingress.enabled }} renders Ingress only when enabled). (6) Security hardening: Don't store secrets in values.yaml (use external secret management: sealed-secrets, external-secrets operator with Vault/AWS Secrets Manager). Use lookup function to read existing secrets (example: {{ lookup "v1" "Secret" .Release.Namespace "db-password" }}). Set securityContext in templates (runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false). Scan charts with tools (helm-secrets for encrypted values, kubesec for security scoring, trivy for vulnerability scanning). (7) Documentation: README.md with installation instructions, parameters table (autogenerated from values.yaml comments using helm-docs tool). values.yaml inline comments describing each parameter (example: replicaCount: 3 # Number of pod replicas). Chart.yaml with maintainers, home URL, sources, keywords. NOTES.txt template displaying post-install instructions (access URLs, credentials location, next steps). (8) Testing: Use helm lint to validate chart syntax and best practices (detects missing required fields, invalid values, template errors). helm template myapp . --values prod-values.yaml renders templates locally without cluster (verify output before install). Chart tests via templates/tests/ directory (Job pods running validation: curl http://myapp/healthz, wait for readiness). Run with helm test myapp after install. (9) Lifecycle hooks: pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete hooks for custom logic (example: pre-install Job runs database schema migration, post-delete Job sends cleanup notification). Configured with annotations: helm.sh/hook: pre-upgrade, helm.sh/hook-weight: 10 (execution order), helm.sh/hook-delete-policy: hook-succeeded (cleanup after success). (10) Upgrade strategies: Use helm upgrade --install (installs if missing, upgrades if exists, idempotent). --atomic flag for automatic rollback on failure. --wait waits for resources to be ready before marking release successful (default timeout: 5 minutes, configurable with --timeout 10m). --dry-run --debug shows rendered templates without applying (preview changes before production upgrade). Common patterns (2025): Multi-environment deployment: Base chart with environment-specific values (values-dev.yaml, values-staging.yaml, values-prod.yaml). Deploy with helm upgrade myapp ./mychart --values values-prod.yaml --namespace prod. Monorepo charts: Parent chart with subcharts (charts/api, charts/worker, charts/frontend), shared global values, coordinated versioning. OCI registry storage: Helm 3.8+ supports OCI registries (Docker Hub, GitHub Container Registry, AWS ECR, Google Artifact Registry). Commands: helm package mychart, helm push mychart-1.0.0.tgz oci://ghcr.io/myorg, helm install myapp oci://ghcr.io/myorg/mychart --version 1.0.0. Advantages: unified artifact storage, vulnerability scanning, access control via registry auth. Helm vs alternatives (2025): Helm (templating + package management), Kustomize (overlay-based, built into kubectl, no templating), Jsonnet (JSON templating), ytt (YAML templating), CDK8s (code-based with TypeScript/Python). Helm dominates for public chart distribution (90%+ of ecosystem charts use Helm). Performance at scale: Helm 3 handles 100+ resource templates in single chart. For 1000+ microservices, consider ArgoCD/Flux GitOps (manages Helm releases declaratively, automated upgrades). Troubleshooting: helm get values myapp shows currently applied values. helm get manifest myapp shows rendered YAML of last deployment. helm history myapp shows revision history with timestamps. helm rollback myapp 5 reverts to revision 5. Migration from Helm 2: Use helm-2to3 plugin (migrates Tiller release data to Helm 3 Secrets, removes Tiller deployment). Key limitations: No drift detection (manual kubectl changes not tracked by Helm, use diff plugin or GitOps tools). No dependency ordering within single chart (use init containers or Helm hooks for sequencing). Template debugging difficult (use --debug --dry-run, validate output step-by-step).

99% confidence
A

Essential kubectl pod management commands (2025 production reference): (1) kubectl get pods: List pods in current namespace with status, ready count, restarts, age. Flags: -n (specific namespace), --all-namespaces or -A (all namespaces), -o wide (additional columns: node, pod IP, nominated node), -o yaml/json (full resource definition), --show-labels (display labels), -l app=nginx (filter by label selector), --field-selector status.phase=Running (filter by field), --watch or -w (stream updates in real-time). Output columns: NAME, READY (2/2 = 2 containers ready out of 2 total), STATUS (Running/Pending/Failed/CrashLoopBackOff/ImagePullBackOff/Terminating), RESTARTS (container restart count), AGE (time since pod created). Example: kubectl get pods -n production -l tier=frontend --field-selector spec.nodeName=node-1. (2) kubectl describe pod : Detailed pod information including events, conditions, containers, volumes, QoS class. Shows: pod metadata (labels, annotations, namespace), node placement, IP addresses, controlled by (Deployment/StatefulSet), container specs (image, ports, env vars, mounts, liveness/readiness probes), resource requests/limits, conditions (PodScheduled, Initialized, ContainersReady, Ready), volumes (ConfigMap/Secret mounts, PVC claims), events (scheduling, image pull, container start/restart/crash, probe failures). Critical for troubleshooting: Last State shows why container restarted (Exit Code 137 = OOMKilled, Exit Code 1 = application error), Events section shows pod lifecycle (FailedScheduling = insufficient resources, ImagePullBackOff = registry auth failure). Example: kubectl describe pod myapp-7d8f9c-xk2lp shows last 1 hour events with timestamps. (3) kubectl logs : Stream pod container logs (stdout/stderr). Flags: -f or --follow (stream logs continuously, like tail -f), --tail=100 (show last 100 lines, default all), --since=1h (logs from last 1 hour), --since-time=2025-01-15T10:00:00Z (absolute timestamp), --timestamps (add RFC3339 timestamp prefix), --previous or -p (logs from previous crashed container, critical for debugging CrashLoopBackOff), -c (specific container in multi-container pod, required if pod has >1 container), --all-containers (all containers in pod, Kubernetes 1.27+). Example: kubectl logs myapp-7d8f9c-xk2lp -c app --tail=500 -f streams last 500 lines and follows. For CrashLoopBackOff debugging: kubectl logs --previous shows why container exited before restart. (4) kubectl exec -it -- /bin/bash: Execute interactive shell inside container for debugging. Flags: -it (interactive terminal with TTY), -c (specific container in multi-container pod), -- (separates kubectl flags from command). Common commands: /bin/bash (bash shell), /bin/sh (sh shell for Alpine-based images), env (list environment variables), ps aux (running processes), netstat -tlnp (listening ports), curl localhost:8080/health (test endpoints), cat /etc/resolv.conf (DNS config). Example: kubectl exec -it myapp-7d8f9c-xk2lp -c sidecar -- curl localhost:15000/stats (query Envoy proxy stats). Non-interactive commands: kubectl exec myapp-7d8f9c-xk2lp -- ls /data (list files without shell). Security note: exec requires pods/exec permission in RBAC, audit all exec commands in production. (5) kubectl delete pod : Delete pod with graceful termination (default 30s grace period, sends SIGTERM then SIGKILL). Flags: --force (skip graceful deletion, immediate SIGKILL, use only for hung pods), --grace-period=60 (custom grace period in seconds), --now (alias for --grace-period=1), -l app=nginx (delete all pods matching label selector, dangerous in production). Behavior: Deployment/ReplicaSet-managed pods recreate immediately (controller maintains desired replica count). For permanent deletion, delete owning controller (kubectl delete deployment myapp). StatefulSet pods recreate with same identity (pod-0 returns as pod-0). Example: kubectl delete pod myapp-7d8f9c-xk2lp --grace-period=120 waits 2 minutes for graceful shutdown. (6) kubectl apply -f : Create or update resources from YAML/JSON file (declarative, idempotent). Flags: -f <file/dir/url> (single file, directory of manifests, or HTTP URL), --dry-run=client (validate locally without applying), --dry-run=server (validate server-side without persisting), --validate=true (schema validation, default true), --prune (delete resources not in current config, use with -l selector). Records last-applied-configuration annotation for three-way merge. Example: kubectl apply -f https://k8s.io/examples/application/deployment.yaml applies remote manifest. For directories: kubectl apply -f ./manifests/ applies all YAML/JSON files recursively. GitOps pattern: kubectl apply -f <(kustomize build overlays/production) pipes Kustomize output to kubectl. (7) kubectl port-forward :: Forward local port to pod port for debugging (bypasses Service, direct pod access). Syntax: kubectl port-forward 8080:80 (local 8080 → pod 80), kubectl port-forward :80 (random local port), kubectl port-forward 8080:80 9090:9090 (multiple ports). Also works with Services: kubectl port-forward svc/myapp 8080:80 (forwards to random pod selected by service). Binding: defaults to localhost only (access via http://localhost:8080), use --address 0.0.0.0 for external access (security risk). Example: kubectl port-forward myapp-7d8f9c-xk2lp 3000:3000 -n production, then curl http://localhost:3000/api/health tests pod endpoint directly. Use case: debug backend services without exposing via LoadBalancer/Ingress. (8) kubectl top pod: Resource usage (CPU/memory) for pods, requires metrics-server installed (kubectl get deployment metrics-server -n kube-system). Flags: --containers (show per-container usage in multi-container pods), --all-namespaces or -A, -l app=nginx (filter by labels), --sort-by=cpu or --sort-by=memory (sort output). Output: NAME, CPU (millicores, 250m = 0.25 core), MEMORY (bytes, 128Mi), CPU% / MEM% (percentage of requests if set). Example: kubectl top pod --all-namespaces --sort-by=memory shows memory hogs. Note: shows actual usage, not requests/limits (compare with kubectl describe for capacity vs usage). Additional critical commands: (9) kubectl get events: Cluster events for troubleshooting (pod scheduling failures, image pulls, volume mounts, probe failures). Flags: --sort-by=.metadata.creationTimestamp (chronological order), --field-selector involvedObject.name= (filter events for specific pod). Events expire after 1 hour (default), shows LAST SEEN, TYPE (Normal/Warning), REASON (FailedScheduling, BackOff, Unhealthy), MESSAGE. (10) kubectl debug : Ephemeral debug container in running pod (Kubernetes 1.23+ stable). Syntax: kubectl debug -it --image=busybox:1.36 --target= (shares process namespace with target container, see all processes). Use case: debug distroless containers without shell (no /bin/bash in prod images). Example: kubectl debug myapp-7d8f9c-xk2lp -it --image=nicolaka/netshoot --target=app (attach debugging tools to production pod). (11) kubectl rollout restart deployment/: Rolling restart of deployment pods (recreates pods one-by-one respecting PodDisruptionBudget). Use case: apply ConfigMap/Secret changes (requires pod restart), clear cache, force image pull. Example: kubectl rollout restart deployment/myapp -n production. Pro tips (2025): Use kubectl aliases (alias k=kubectl, alias kgp='kubectl get pods', alias kd='kubectl describe', alias kl='kubectl logs'). Shell completion: source <(kubectl completion bash) enables tab completion. Context switching: kubectl config use-context production, kubectl config set-context --current --namespace=myapp (change default namespace). Quick pod access: kubectl run tmp-shell --rm -i --tty --image=nicolaka/netshoot -- /bin/bash (temporary debug pod, auto-deleted on exit). Common troubleshooting workflows: (1) Pod not starting: kubectl get pod (check STATUS), kubectl describe pod (check Events for FailedScheduling/ImagePullBackOff), kubectl logs (check application errors). (2) Pod crashing: kubectl logs --previous (last crash logs), kubectl describe pod (check Exit Code: 137=OOM, 1=error, 143=SIGTERM), kubectl exec -it -- sh (test startup manually). (3) Service unreachable: kubectl port-forward 8080:8080 (test pod directly), kubectl get endpoints (verify pods selected), kubectl logs (check application listening on correct port).

99% confidence
A

Four access modes define how volumes can be mounted: (1) ReadWriteOnce (RWO) - Volume mounted read-write by a single node. Multiple pods on the SAME node can mount it simultaneously (node-level restriction, not pod-level). Most common mode. Use case: Single-instance databases (PostgreSQL, MySQL), StatefulSet workloads. Example: AWS EBS, GCP Persistent Disk, Azure Disk. Behavior: Pod on node-A mounts successfully, another pod on node-A also mounts (same node), but pod on node-B fails with Multi-Attach error. (2) ReadOnlyMany (ROX) - Volume mounted read-only by multiple nodes simultaneously. All pods can read, none can write. Use case: Shared configuration files, ML model serving, static assets. Example: NFS, CephFS in read-only mode. (3) ReadWriteMany (RWX) - Volume mounted read-write by multiple nodes simultaneously. All pods can read AND write concurrently. Use case: Shared uploads directory (WordPress), distributed logging, collaborative storage. Example: NFS, AWS EFS, Azure File. NOT supported by block storage (EBS, Azure Disk). (4) ReadWriteOncePod (RWOP, stable in 1.29) - Volume mounted read-write by a single pod across entire cluster (strictest isolation). Stronger than RWO - ensures one pod only, not just one node. Use case: Databases requiring absolute exclusivity (prevents split-brain), license-restricted apps. Supported by AWS EBS CSI 1.13+, GCP PD CSI 1.8+, Azure Disk CSI 1.23+. Specified in PVC with accessModes: [ReadWriteOnce] or [ReadWriteMany] etc.

99% confidence
A

ReadWriteMany (RWX) requires network file systems - block storage does NOT support RWX. Supported: NFS (most common, self-hosted or NAS appliances), AWS EFS (via EFS CSI driver), Azure File (SMB-based), GCP Filestore (via Filestore CSI driver), CephFS (POSIX-compliant distributed filesystem), GlusterFS, Portworx (distributed block storage with file mode). NOT supported: AWS EBS, GCP Persistent Disk, Azure Disk, Ceph RBD, iSCSI, local volumes - all block storage types. Reason: Block devices cannot multi-attach for writes at kernel level (physical limitation). Attempting RWX PVC with storage class backed by EBS (example: storageClassName: gp3) fails with ProvisioningFailed: volume plugin does not support ReadWriteMany. Solution: Use NFS/EFS storage class or redesign app for pod-local storage with external object storage (S3/GCS). Performance trade-off: Network file systems (NFS) have higher latency (2-5ms) versus block storage (EBS 0.5-1ms). Use RWX only when multiple pods truly need concurrent write access.

99% confidence
A

RWO allows multiple pods on the SAME node to mount the volume, RWOP allows only ONE pod in the entire cluster. RWO (ReadWriteOnce): Node-level restriction. Volume attaches to single node, but multiple pods scheduled to that node can all mount it. Common confusion: RWO does NOT mean single pod. Example: Deployment with 3 replicas, all pods scheduled to node-A, all 3 pods successfully mount RWO volume simultaneously. StatefulSet with RWO: Each pod gets separate PVC (pod-0 → pvc-0, pod-1 → pvc-1), avoiding conflicts. RWOP (ReadWriteOncePod): Pod-level restriction (strictest). Only one pod can mount the volume across entire cluster, even if on same node. Kubernetes 1.29+ stable feature. Example: PostgreSQL Deployment with RWOP PVC - first pod mounts successfully, second pod on ANY node (including same node) fails with volume already in use. Use case: Prevent split-brain scenarios for databases (during pod rescheduling, ensures old pod unmounts before new pod mounts). Requirement: CSI driver must support SINGLE_NODE_SINGLE_WRITER capability (AWS EBS CSI 1.13+, GCP PD CSI 1.8+, Azure Disk CSI 1.23+). Migration: Changing RWO to RWOP requires recreating PVC (not in-place update, data backup needed).

99% confidence
A

Headless Service: Service with spec.clusterIP: None (no cluster IP allocation, no kube-proxy load balancing). DNS behavior: returns A/AAAA records pointing directly to pod IPs instead of single service IP. Required for StatefulSets to provide stable network identity: each pod gets DNS record ...svc.cluster.local (example: web-0.nginx.default.svc.cluster.local). Use cases: (1) StatefulSet network identity - required for databases (PostgreSQL, MongoDB, Cassandra) needing stable hostnames. (2) Direct pod-to-pod communication - service mesh, Kafka brokers, Elasticsearch nodes. (3) Client-side load balancing - gRPC clients connecting to all pods directly. (4) Custom service discovery - applications implementing own load balancing logic. Configuration: spec.clusterIP: None + selector labels. DNS returns multiple A records (one per pod). Pods must be running and ready (readiness probe passes) to appear in DNS. Example: StatefulSet with 3 replicas creates web-0, web-1, web-2 with individual DNS entries. Essential for StatefulSet stable network identity and distributed systems requiring direct peer addressing.

99% confidence
A

StorageClass: API object describing storage 'classes' with different performance/backup policies (SSD vs HDD, replicated vs non-replicated, fast vs slow). Enables dynamic provisioning: automatically creates PersistentVolume when PersistentVolumeClaim requests StorageClass, eliminating manual PV pre-provisioning. Key fields: (1) provisioner - creates PV (kubernetes.io/aws-ebs, kubernetes.io/gce-pd, kubernetes.io/azure-disk, csi.drivers like ebs.csi.aws.com for CSI). (2) parameters - provisioner-specific config (AWS: type=gp3 iopsPerGB=50, GCP: type=pd-ssd replication-type=regional-pd, Azure: storageaccounttype=Premium_LRS). (3) reclaimPolicy - Delete (default, deletes PV when PVC deleted) or Retain (keeps PV for manual cleanup). (4) volumeBindingMode - Immediate (bind PV immediately) or WaitForFirstConsumer (delay binding until pod scheduled, ensures PV in same zone as pod). (5) allowVolumeExpansion - enables resizing PVCs. Default StorageClass (annotation: storageclass.kubernetes.io/is-default-class: "true") used when PVC doesn't specify storageClassName. Example PVC: storageClassName: fast-ssd, resources.requests.storage: 10Gi. Production benefits: automated storage provisioning, policy-driven storage tiers, simplified developer experience (request storage without knowing infrastructure details).

99% confidence
A

NodePort Service allocates port from range 30000-32767 by default (configurable via kube-apiserver --service-node-port-range flag). High port range rationale: (1) Avoids well-known ports (0-1023, requires root, reserved for system services like SSH 22, HTTP 80, HTTPS 443). (2) Avoids registered ports (1024-49151, used by common applications like MySQL 3306, PostgreSQL 5432, Redis 6379). (3) Prevents conflicts with services already running on nodes (monitoring agents, sidecar proxies, host daemons). (4) Ephemeral port range compatibility (Linux typically 32768-60999, Windows 49152-65535 - NodePort range avoids overlap). NodePort behavior: same port allocated on ALL nodes in cluster (example: NodePort 30080 accessible via any-node-ip:30080). Port allocation: automatic (Kubernetes assigns random available port) or manual (specify spec.ports[].nodePort: 30080 within allowed range). Production usage: typically behind cloud LoadBalancer (AWS ELB, GCP GLB) or Ingress Controller, not exposed directly to internet. Development/on-prem: direct NodePort access acceptable with firewall rules. Security consideration: NodePort exposes service on all nodes - use NetworkPolicy to restrict pod-level access.

99% confidence
A

Namespaces: virtual clusters within physical cluster providing scope for resource names and resource isolation. Use cases: (1) Multi-tenancy - isolate teams/customers (team-a, team-b, customer-123). (2) Environment separation - dev, staging, prod in same cluster. (3) Resource organization - frontend, backend, database namespaces. (4) Resource quota enforcement - limit CPU/memory per namespace. Namespaced resources: pods, services, deployments, configmaps, secrets, replicasets, jobs, cronjobs, ingresses, networkpolicies, serviceaccounts, roles, rolebindings, persistentvolumeclaims. Cluster-scoped resources (NOT namespaced): nodes, persistentvolumes, storageclasses, clusterroles, clusterrolebindings, namespaces themselves. Initial namespaces: (1) default - default for objects without namespace. (2) kube-system - Kubernetes system components (kube-dns, metrics-server, kube-proxy). (3) kube-public - readable by all, convention for publicly accessible config. (4) kube-node-lease - node heartbeat objects for performance. DNS format: ..svc.cluster.local (example: api.backend.svc.cluster.local). Best practices: (1) Avoid 'default' namespace for production apps. (2) Use ResourceQuotas per namespace (limits.cpu: 10, limits.memory: 20Gi, pods: 50). (3) RBAC for namespace access control (RoleBinding grants namespace-scoped permissions). (4) LimitRanges for default resource requests/limits. (5) NetworkPolicy for namespace-level network isolation. (6) Naming convention: - (platform-prod, data-dev). Production impact: isolation prevents resource conflicts, quotas prevent resource exhaustion, RBAC limits blast radius of compromised credentials.

99% confidence
A

kubectl create vs kubectl apply (2025 production guide): Two approaches to resource management with fundamentally different philosophies and use cases. (1) kubectl create (Imperative command): Creates new resource from file/CLI flags, fails if resource already exists (AlreadyExists error). One-time operation, no update capability. Syntax: kubectl create -f deployment.yaml, kubectl create deployment nginx --image=nginx:1.25, kubectl create secret generic db-creds --from-literal=password=secret123. Behavior: Reads file/flags → creates resource → exits. No tracking of previous state. Subsequent kubectl create -f deployment.yaml fails with Error from server (AlreadyExists): deployments.apps nginx already exists. Use cases for create: (1) Quick testing/prototyping - fast resource creation without declarative files (kubectl create deployment test --image=busybox --replicas=3). (2) One-off resources - Jobs, debug pods, temporary ConfigMaps (kubectl create job test-job --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'). (3) Imperative workflows - scripts where idempotency not required, explicit creation vs update logic. (4) Resource generation - create deployment YAML without applying (kubectl create deployment nginx --image=nginx:1.25 --dry-run=client -o yaml > deployment.yaml, then edit before applying). Limitations: No updates (must use kubectl replace or delete + recreate), no change tracking (can't see what changed between versions), no partial field updates (must specify all fields). (2) kubectl apply (Declarative command): Creates or updates resource from file (idempotent, can run multiple times safely). Tracks configuration changes via kubectl.kubernetes.io/last-applied-configuration annotation (stores previous config for three-way merge). Syntax: kubectl apply -f deployment.yaml, kubectl apply -f ./manifests/ (directory), kubectl apply -k overlays/production (Kustomize). Behavior: Reads file → calculates diff (last-applied-configuration vs live state vs desired state) → patches resource → updates annotation. If resource doesn't exist, creates it. If exists, patches with changes only (preserves fields not in manifest). Three-way merge: (1) Last applied configuration (stored in annotation), (2) Current live state (etcd), (3) New desired state (manifest file). Algorithm: If field in desired but not last-applied, add field (new configuration). If field in last-applied but not desired, delete field (explicitly removed). If field in desired with different value, update field (modified configuration). If field in live but not desired/last-applied, preserve field (added by controller/operator, don't touch). Example: Original manifest has replicas: 3. Apply it. Later, HPA controller sets replicas: 5 in live state (autoscaling). New manifest changes image: nginx:1.26 but omits replicas field. kubectl apply preserves replicas: 5 (not in last-applied, assumed controller-managed), updates image only. Use cases for apply: (1) GitOps workflows - manifest files in Git, CI/CD applies on merge (ArgoCD, Flux CD pattern). (2) Production deployments - declarative config versioned in SCM, auditable changes via Git history. (3) Incremental updates - change 1 field in manifest, apply updates that field only (no need to specify all fields). (4) Multi-environment configs - base manifests + overlays (Kustomize, Helm), apply generates final config. (5) Disaster recovery - entire cluster state defined in manifests, kubectl apply -f ./cluster-state/ recreates all resources. Key differences: Idempotency: create fails on duplicate (not idempotent), apply succeeds always (idempotent). Updates: create cannot update (must use kubectl replace --force which deletes + recreates, causes downtime), apply seamlessly updates (rolling update for Deployments, in-place patch for ConfigMaps). State tracking: create doesn't track (no annotation), apply maintains last-applied-configuration annotation (enables diff, rollback). Partial updates: create requires full resource spec (omitted fields use API defaults, may cause unintended changes), apply only touches specified fields (preserves unspecified fields from live state). Server-side apply (Kubernetes 1.22+ stable): Enhanced apply using --server-side flag. Moves merge logic from client to API server (better conflict resolution, field ownership tracking). Syntax: kubectl apply -f deployment.yaml --server-side. Benefits: (1) Field managers - tracks which controller/user owns each field (kubectl manages spec.image, HPA manages spec.replicas, no conflicts). (2) Faster - server-side diff, no need to download full resource to client. (3) Large resources - handles CRDs with 10,000+ fields without client-side memory issues. (4) Conflict detection - errors if field owned by different manager (prevents accidental overwrites). Example: kubectl apply --server-side --field-manager=my-ci-pipeline -f deployment.yaml. kubectl replace (Alternative to create for updates): Updates entire resource (not a merge, full replacement). Syntax: kubectl replace -f deployment.yaml. Requires resourceVersion in manifest (prevents concurrent modification). Deletes + recreates resource if --force flag used (causes downtime, orphans pods momentarily). Best practices (2025): (1) Production: Always use apply: Enables GitOps, tracks changes, idempotent, safe for automation. Store manifests in Git, apply via CI/CD (GitHub Actions, GitLab CI). (2) Development: Use create for ephemeral resources: Quick testing pods (kubectl create -f test-pod.yaml), Jobs (kubectl create job test --image=busybox). Faster than maintaining declarative files for throwaway resources. (3) Server-side apply for large CRDs: Istio VirtualService, Crossplane Compositions, large ConfigMaps (>1MB). Reduces client memory usage, better conflict handling. (4) Dry-run before apply: kubectl apply -f deployment.yaml --dry-run=server shows what would change without applying (safe preview). Compare with --dry-run=client (client-side validation only, no server schema check). (5) Declarative everything: ConfigMaps, Secrets, RBAC, NetworkPolicies - all via apply, not imperative create. Ensures reproducibility (cluster rebuild from manifests directory). (6) Annotations for tracking: Add custom annotations (git.commit: abc123, deployed.by: ci-pipeline, deployed.at: 2025-01-15T10:30:00Z) to manifests for audit trail. Common workflows: Initial deployment: kubectl apply -f deployment.yaml (creates Deployment). Update image: Edit deployment.yaml (change image: nginx:1.25 → nginx:1.26), kubectl apply -f deployment.yaml (triggers rolling update). Rollback: Git revert commit → kubectl apply -f deployment.yaml (reverts to previous config). Delete: kubectl delete -f deployment.yaml (deletes resources defined in file, alternative to kubectl delete deployment nginx). Troubleshooting: Warning: resource X is missing last-applied-configuration annotation: Resource created with kubectl create, first kubectl apply adds annotation. Safe to ignore. Error: field X is immutable: Some fields cannot be updated (Deployment selector, Service clusterIP). Solution: Delete + recreate resource (kubectl delete -f file.yaml && kubectl apply -f file.yaml) or edit immutable field to create new resource. Conflict (server-side apply): Error: Apply failed with 1 conflict - field spec.replicas managed by HPA. Solution: Remove conflicting field from manifest (let controller manage it) or force overwrite with --force-conflicts (dangerous, may break autoscaling). Performance: kubectl apply is slower than create for initial creation (calculates diff, stores annotation, ~50ms overhead). Negligible in production (prioritize idempotency over 50ms). For bulk operations (1000+ resources), consider server-side apply (10x faster than client-side for large resources). Integration with tools: Helm: Uses helm install/upgrade (not kubectl apply), but Helm 3 uses three-way merge similar to apply. Kustomize: Built into kubectl (kubectl apply -k

), generates manifests then applies. ArgoCD/Flux CD: Uses kubectl apply internally, polls Git repos, auto-applies changes. Terraform Kubernetes provider: Uses terraform apply (maps to kubectl apply pattern, tracks state in .tfstate). Migration path: If existing resources created with kubectl create (no annotation), first kubectl apply adds annotation (Warning shown but succeeds). Subsequent applies work normally (three-way merge enabled).

99% confidence
A

Kubernetes Probes (2025): Kubelet-executed health checks for container lifecycle management and traffic routing. Three probe types with distinct purposes: (1) Liveness Probe: Detects if container is running correctly, restarts container if probe fails (recovers from deadlocks, hung processes, unrecoverable errors). Question answered: Is the container still alive? Failure action: Kubelet kills container, respects pod restartPolicy (Always/OnFailure restarts container, Never leaves container terminated). Use cases: Detect application deadlocks (web server hung, no longer responding to requests), infinite loops (app stuck in bad state), memory leaks causing app freeze (out of memory but not OOMKilled yet), corrupted state (database connection pool exhausted, cannot recover). Example: Java app with memory leak gradually becomes unresponsive → liveness probe fails after 3 consecutive failures (failureThreshold: 3) → kubelet restarts container → fresh start recovers app. Anti-pattern: Don't use liveness probe for startup failures (use startup probe instead) - rapid liveness restarts during slow startup cause CrashLoopBackOff. Best practice: Conservative settings to avoid flapping - initialDelaySeconds: 60 (wait 60s after container start), periodSeconds: 10 (check every 10s), timeoutSeconds: 5 (probe timeout), failureThreshold: 3 (3 consecutive failures before restart). (2) Readiness Probe: Detects if container is ready to serve traffic, removes pod from Service endpoints if probe fails (prevents routing requests to unhealthy pods). Question answered: Can this container handle traffic right now? Failure action: Pod marked NotReady, removed from Service load balancing (kubectl get endpoints shows pod excluded), no container restart. Use cases: Application warming up (loading caches, warming JVM, downloading data), temporary overload (high CPU usage, cannot handle more requests), dependency unavailable (database connection lost, waiting for reconnection), graceful shutdown (draining connections during preStop hook). Example: E-commerce app depends on Redis cache. Redis pod restarts → app cannot serve requests without cache → readiness probe fails (GET /ready returns 503) → pod removed from Service → requests route to healthy pods only → Redis comes back → readiness probe succeeds → pod added back to Service. Critical difference from liveness: Readiness is temporary (pod recovers, added back to Service), liveness is permanent failure (requires restart). Best practice: Fail fast on dependencies - readiness probe checks database connectivity, cache availability, downstream API health. Probe frequently (periodSeconds: 5) to quickly detect/recover from transient issues. (3) Startup Probe (Kubernetes 1.20+ stable): Detects if application has finished starting, disables liveness/readiness probes until first success (prevents premature liveness kills during slow startup). Question answered: Has the container finished initializing? Failure action: After failureThreshold * periodSeconds total time, kubelet kills container (startup timeout exceeded). Once succeeds, disabled permanently (liveness/readiness take over). Use cases: Slow-starting applications (Java apps with 60s+ startup time, ML models loading large datasets, database schema migrations before app starts), legacy apps with unpredictable startup duration, microservices with heavy initialization (populating caches, building indexes). Example: Spring Boot app takes 120 seconds to start (loads 5GB dataset into memory). Without startup probe: liveness probe starts after initialDelaySeconds: 60, fails at 60s (app not ready yet) → kubelet restarts container → infinite CrashLoopBackOff. With startup probe: failureThreshold: 30, periodSeconds: 5 (total 150s allowed startup time) → app starts at 120s → startup probe succeeds → liveness/readiness probes enabled → normal operation. Configuration: Higher failureThreshold (30-60) + short periodSeconds (5-10) allows long startup without probing too infrequently. Probe methods (2025): (1) httpGet: HTTP GET request to container IP + port + path. Success: 200 <= status code < 400. Configuration: httpGet.path: /healthz, httpGet.port: 8080, httpGet.scheme: HTTP/HTTPS, httpGet.httpHeaders (custom headers for authentication). Example: GET http://pod-ip:8080/healthz, expects 200 OK with response body healthy. Most common method for web apps (REST APIs, web servers). Security: Use HTTPS scheme for probes to encrypted endpoints (requires valid cert in container or httpGet.insecureSkipTLSVerify: true). (2) tcpSocket: TCP connection to container port. Success: connection established (socket opened successfully). Configuration: tcpSocket.port: 3306. Example: Check MySQL readiness by attempting TCP connection to port 3306. Useful for protocols without HTTP endpoints (databases, message queues, caches - Redis, PostgreSQL, RabbitMQ, Memcached). Limitation: Only checks port is listening, not application health (MySQL port open but database corrupted still passes). (3) exec: Executes command inside container. Success: exit code 0. Configuration: exec.command: [cat, /tmp/healthy] (check file exists), exec.command: [pg_isready, -U, postgres] (PostgreSQL readiness check), exec.command: [redis-cli, ping] (Redis health check). Example: Liveness probe runs ls /app/server.pid every 10s, fails if PID file missing (app crashed). Use cases: Legacy apps without HTTP endpoints, custom health logic (check database row count, verify file system state, run application-specific validation). Performance: Slower than httpGet/tcpSocket (spawns shell process, incurs overhead). Avoid complex commands (no database queries taking >1s). (4) grpc (Kubernetes 1.27+ stable): gRPC health check via grpc.health.v1.Health/Check RPC. Configuration: grpc.port: 50051, grpc.service: myapp (optional service name). Example: gRPC server implements health service returning SERVING/NOT_SERVING status. Native support for gRPC microservices (Envoy, Istio service mesh). Advantage: Type-safe, efficient binary protocol, standard health check interface. Probe configuration parameters (2025): (1) initialDelaySeconds (default: 0): Wait before first probe after container starts. Use for apps with known startup time (Java: 30-60s, Python/Node.js: 5-15s, Go: 1-5s). Too low → premature failures during startup. Too high → delayed failure detection. (2) periodSeconds (default: 10): Probe frequency. Liveness: 10-30s (not too frequent, avoid unnecessary restarts). Readiness: 5-10s (detect traffic issues quickly). Startup: 5-10s (balance startup time coverage vs probe overhead). (3) timeoutSeconds (default: 1): Probe timeout. Increase for slow endpoints (database health checks: 3-5s, fast APIs: 1-2s). Timeout triggers failure (counts toward failureThreshold). (4) failureThreshold (default: 3): Consecutive failures before action. Liveness: 3-5 (avoid flapping restarts from transient network issues). Readiness: 2-3 (remove from Service quickly). Startup: 30-60 (long startup tolerance). (5) successThreshold (default: 1): Consecutive successes before success (only for readiness/startup, liveness always 1). Readiness: 1-2 (add pod back to Service after 1-2 successes). Useful to prevent flapping in/out of Service (set to 2-3 for unstable apps). Production patterns (2025): Web API: Liveness: GET /healthz (checks app is responsive), Readiness: GET /ready (checks dependencies - database, cache, downstream APIs), Startup: GET /healthz with failureThreshold: 30 (allow 150s startup). Database: Liveness: exec [pg_isready] or tcpSocket 5432, Readiness: exec [psql -c SELECT 1] (verify query succeeds), Startup: tcpSocket 5432 with high failureThreshold. Microservice with dependencies: Liveness: lightweight check (GET /ping, no external calls), Readiness: comprehensive check (verify all dependencies reachable), Startup: tcpSocket to app port. Common mistakes (2025): (1) Liveness probe checks dependencies: App depends on database → liveness probe queries database → database down → liveness fails → restarts all app pods → database still down → infinite restart loop. Fix: Liveness checks only app health (is process alive?), readiness checks dependencies (are dependencies reachable?). (2) No startup probe for slow apps: 60s startup app with liveness initialDelaySeconds: 30 → killed at 30s before startup completes → CrashLoopBackOff. Fix: Add startup probe with sufficient failureThreshold. (3) Readiness == liveness: Using same endpoint for both → temporary overload triggers liveness restart (should only affect readiness). Fix: Separate endpoints - liveness is lightweight, readiness is comprehensive. (4) Probe timeout too short: Health check queries database taking 2s → timeoutSeconds: 1 → fails even when app healthy. Fix: Set timeout > expected response time + network latency (3-5s for database checks). (5) Too aggressive thresholds: failureThreshold: 1 with periodSeconds: 5 → single network blip causes restart. Fix: failureThreshold: 3 allows transient failures. Monitoring probes: kubectl describe pod shows probe failures in Events (Liveness probe failed: HTTP probe failed with statuscode: 500, Readiness probe failed: Get http://10.0.1.5:8080/ready: dial tcp 10.0.1.5:8080: connection refused). Metrics: kube_pod_container_status_restarts_total (Prometheus metric tracking liveness restarts). Impact on Service: Only readiness affects Service endpoints (kubectl get endpoints myapp shows pod IPs, only Ready pods included). Liveness/startup failures don't remove from Service (pod remains in endpoints until marked NotReady or deleted).

99% confidence
A

Kubernetes Job (2025): Controller ensuring one or more pods run to successful completion (exit code 0), ideal for finite tasks vs long-running services (Deployments). Key characteristics: (1) Completion tracking: Tracks successful pod completions (spec.completions: 5 means 5 pods must succeed), tolerates failures up to spec.backoffLimit (default 6 retries before marking Job failed). (2) Pod lifecycle: Creates pods, monitors completion, retries failures (exponential backoff: 10s, 20s, 40s, 80s, 160s, 320s max between retries), deletes pods after completion if ttlSecondsAfterFinished set (automatic cleanup). (3) Non-restarting: Completed pods not restarted (unlike Deployment which maintains desired replica count indefinitely). Job patterns: (1) Single completion (spec.completions: 1, spec.parallelism: 1): One pod runs to completion. Use: One-time database migration, single backup task, report generation. Example: Database schema migration job - runs pg_migrate script once, completes, job done. (2) Fixed completion count (spec.completions: 10, spec.parallelism: 3): 10 pods must succeed, max 3 running concurrently. Use: Batch processing (process 10 files, 3 workers at a time), distributed ETL (10 data chunks processed in parallel). Example: Image processing job processes 1000 images - completions: 1000, parallelism: 50 (50 workers process 20 images each). (3) Work queue (spec.completions omitted, spec.parallelism: 5): Pods pull tasks from external queue (Redis, RabbitMQ), job completes when queue empty and all pods finish. Use: Dynamic workload (queue length unknown upfront), distributed task processing. Example: Video transcoding - workers poll SQS queue, transcode videos, job finishes when queue drained. Job configuration: spec.template (pod template identical to Deployment podTemplate), spec.backoffLimit: 6 (max retries before job fails, prevent infinite retry loops), spec.activeDeadlineSeconds: 3600 (kill job after 1 hour regardless of completion, prevents hung jobs), spec.ttlSecondsAfterFinished: 100 (delete job + pods 100s after completion, automatic garbage collection). Failure handling: Pod fails with exit code 1 → Job creates new pod (retry count++), backoffLimit exceeded → Job marked Failed, remaining pods terminated. Node failure: Pods on failed node rescheduled to healthy nodes (Job continues). Indexing (Kubernetes 1.24+ stable): Indexed Job with spec.completionMode: Indexed assigns each pod unique index (0 to completions-1) via JOB_COMPLETION_INDEX env var. Use case: Process array of inputs where each pod handles specific index (pod-0 processes file-0.csv, pod-1 processes file-1.csv). Example: Training 100 ML models in parallel - completionMode: Indexed, completions: 100, parallelism: 20, each pod trains model for its assigned dataset index. Kubernetes CronJob (2025): Creates Jobs on recurring schedule (Cron syntax), manages Job lifecycle (creates, monitors, cleans up), ideal for periodic tasks. CronJob configuration: (1) spec.schedule: Cron syntax (* * * * * = minute hour day month weekday). Examples: 0 2 * * * (2am daily), */15 * * * * (every 15 minutes), 0 0 * * 0 (midnight every Sunday), 0 */6 * * * (every 6 hours). Supports non-standard entries: @hourly, @daily, @weekly, @monthly, @yearly. Timezone: spec.timeZone: America/New_York (Kubernetes 1.27+, previously assumed UTC only). (2) spec.jobTemplate: Job spec template used for each scheduled run (same fields as Job: completions, parallelism, backoffLimit). (3) spec.concurrencyPolicy: Behavior when new job scheduled while previous still running. Allow (default): Run concurrent jobs (multiple instances), Forbid: Skip new run if previous still active (prevent overlap), Replace: Cancel previous job, start new one (only one job at a time). Example: Database backup every hour with concurrencyPolicy: Forbid - if backup takes 90 minutes, 1-hour trigger skipped (prevents concurrent backups corrupting data). (4) spec.successfulJobsHistoryLimit: Keep last N successful jobs (default 3). Failed jobs: spec.failedJobsHistoryLimit (default 1). Automatic cleanup prevents Job resource accumulation. (5) spec.startingDeadlineSeconds: If job cannot be scheduled within deadline (controller down, quota exceeded), mark as missed. Example: startingDeadlineSeconds: 600 - if CronJob misses 10-minute window, skip run rather than queue backlog. (6) spec.suspend: Pause CronJob without deleting (suspend: true stops new job creation, existing jobs continue). Use: Temporarily disable scheduled backups during maintenance. CronJob vs Job differences: Job: One-time finite task, manually created (kubectl apply -f job.yaml), runs until completion or failure, no scheduling. CronJob: Recurring scheduled task, automatically creates Jobs per schedule, manages Job history, supports concurrency policies. Relationship: CronJob creates Job, Job creates Pods (CronJob → Job → Pod hierarchy). Example: CronJob backup-daily creates Job backup-daily-1705392000 at scheduled time, Job creates Pod backup-daily-1705392000-abc123 to perform backup. Production use cases: Job: One-time database migration (migrate-v2-to-v3 job), batch data import (import-customer-data job), report generation (generate-monthly-report job), ETL pipeline step (transform-data-batch job), ML model training (train-model-version-5 job). CronJob: Scheduled backups (database dump every 6 hours), log rotation (compress old logs daily), cache warming (preload caches every morning before traffic spike), data cleanup (delete old records weekly), health checks (validate data integrity nightly), certificate renewal (check cert expiry monthly), report distribution (email weekly reports). Best practices (2025): (1) Idempotent jobs: Design tasks to be safely re-runnable (job retries shouldn't corrupt data). Use unique identifiers, check-before-execute patterns. (2) Resource limits: Set requests/limits to prevent resource exhaustion (job pods scheduled like any pod, need resources). Example: Memory-intensive ETL job with limits.memory: 4Gi prevents OOMKilled failures. (3) Completion deadlines: Set activeDeadlineSeconds to kill hung jobs (prevent stuck jobs consuming resources indefinitely). Database backup with 1-hour deadline prevents 10-hour hung job. (4) Monitoring: Track job failures (kubectl get jobs shows COMPLETIONS 0/1, SUCCEEDED/FAILED), alert on consecutive CronJob failures (3+ missed backups = alert). Prometheus metrics: kube_job_status_failed, kube_cronjob_status_last_schedule_time. (5) Cleanup: Enable TTL with ttlSecondsAfterFinished: 86400 (delete completed jobs after 24 hours, keep logs for debugging). CronJob history limits prevent accumulation (limit: 3 successful, 1 failed). (6) Concurrency control: Use Forbid for exclusive tasks (database backups, schema migrations), Allow for parallelizable tasks (send email reminders, process event logs). (7) Timezone awareness: Specify timeZone explicitly for DST correctness (America/New_York handles daylight saving transitions, UTC never changes). Common issues: (1) Missed schedules: CronJob controller unavailable during scheduled time → job missed. Check with kubectl describe cronjob shows Last Schedule Time, Warning events. Use startingDeadlineSeconds to define tolerance window. (2) Too many concurrent jobs: concurrencyPolicy: Allow with long-running jobs → resource exhaustion (20 concurrent backup jobs). Fix: Use Forbid or Replace. (3) Job doesn't complete: Pod exits code 0 but Job stuck active. Check: completions mismatch (completions: 5 but only 1 pod created), verify pod logs show success. (4) Exponential backoff delays: Job retries with increasing delays (6th retry waits 320s before starting). Monitor backoffLimit breached events. Parallelism tuning: High parallelism (parallelism: 100) for I/O-bound tasks (API calls, file downloads), low parallelism (parallelism: 5) for CPU/memory-intensive tasks (video encoding, model training), prevents node resource exhaustion. CronJob schedule examples: Daily at 2:30am: 30 2 * * *, Every Monday at 9am: 0 9 * * 1, First day of month at midnight: 0 0 1 * *, Every 6 hours: 0 */6 * * *, Every weekday at 5pm: 0 17 * * 1-5. Troubleshooting: kubectl get jobs shows COMPLETIONS 3/5 (3 succeeded, need 5 total), DURATION 10m, AGE 15m. kubectl get pods -l job-name=my-job shows individual pod statuses (Completed, Failed, Running). kubectl logs job/my-job shows logs from all job pods (job selector auto-applied). kubectl describe job my-job shows Events (SuccessfulCreate, BackoffLimitExceeded warnings).

99% confidence
A

Resource Requests and Limits (2025): Per-container resource allocation controls for CPU and memory, critical for cluster capacity planning, pod scheduling, and runtime enforcement. (1) Resource Requests: Guaranteed minimum resources allocated to container, used by scheduler for pod placement decisions, enforced during scheduling (not runtime). CPU Request: Minimum CPU shares container receives (measured in cores: 1000m = 1 CPU core, 500m = 0.5 core, 100m = 0.1 core). Example: resources.requests.cpu: 500m guarantees container gets 0.5 CPU core worth of time slices. Scheduler behavior: Scheduler sums all container requests in pod, finds node with sufficient allocatable CPU (node.status.allocatable.cpu >= sum of pod requests). If no node has capacity, pod stays Pending with event: 0/5 nodes available: Insufficient cpu. Runtime behavior: Request is soft limit (container can use more if available). Linux CFS (Completely Fair Scheduler) uses cpu.shares (request 500m = 512 shares, 1000m = 1024 shares, proportional CPU time allocation when contention occurs). Memory Request: Minimum memory container receives (measured in bytes: 128Mi = 134,217,728 bytes, 1Gi = 1,073,741,824 bytes, 500Mi = 524,288,000 bytes). Example: resources.requests.memory: 256Mi guarantees 256 MiB available for container. Scheduler behavior: Scheduler checks node.status.allocatable.memory >= sum of pod memory requests. Insufficient memory → pod Pending. Runtime behavior: Request is scheduling hint only (kubelet doesn't enforce minimum memory allocation, container can use less than requested). Key purpose: Scheduler placement + resource accounting (prevent overcommitment), not runtime enforcement. (2) Resource Limits: Maximum resources container can consume, enforced at runtime by kubelet (CPU throttling, memory OOMKilled). CPU Limit: Maximum CPU container allowed to use (hard cap). Example: resources.limits.cpu: 2000m caps container at 2 CPU cores. Enforcement: Linux cgroup cpu.cfs_quota_us and cpu.cfs_period_us. Container using > limit gets throttled (container.cpu.cfs_throttled_seconds_total metric increases, processes see high CPU wait time). No pod eviction for CPU limit exceeded (throttling only). Effect: Application performance degradation (slow response times, increased latency), but pod keeps running. Monitoring: High CPU throttling (>10% of time throttled) indicates limit too low or application inefficiency. Memory Limit: Maximum memory container allowed to use before OOMKilled. Example: resources.limits.memory: 512Mi caps container at 512 MiB. Enforcement: Linux cgroup memory.limit_in_bytes. Container exceeding limit triggers OOM killer → kubelet kills container (exit code 137 with reason OOMKilled) → container restart (if restartPolicy allows). Effect: Pod restart, potential service disruption, container startup delay. Critical difference from CPU: Memory limit breach kills container (unrecoverable without restart), CPU limit throttles (degraded performance but keeps running). Resource units: CPU: 1 CPU = 1 vCPU/core (AWS vCPU, GCP core, Azure vCore, physical core). Fractional cores: 1000m = 1 CPU, 500m = 0.5 CPU, 100m = 0.1 CPU, 1m = 0.001 CPU (minimum). Alternative syntax: 0.5 = 500m, 2 = 2000m. Memory: Binary units (1Ki = 1024 bytes, 1Mi = 1024 Ki, 1Gi = 1024 Mi) vs decimal (1k = 1000 bytes, 1M = 1000k, 1G = 1000M). Prefer binary (Mi, Gi) for consistency. Examples: 128Mi, 1.5Gi, 512Mi, 4Gi. No fractional syntax (use Mi: 0.5Gi = 512Mi). QoS (Quality of Service) Classes: Kubernetes assigns QoS based on requests/limits, determines eviction order during node pressure (memory/disk shortage). (1) Guaranteed (highest priority): All containers have requests = limits for CPU AND memory. Pod gets guaranteed resources, lowest eviction priority (evicted last). Example: Container with requests.cpu: 1, limits.cpu: 1, requests.memory: 1Gi, limits.memory: 1Gi → QoS: Guaranteed. Use case: Critical production services (databases, payment systems, control plane pods). (2) Burstable (medium priority): At least one container has requests < limits OR requests set but not limits. Pod can burst above requests (use idle resources), medium eviction priority (evicted before Guaranteed, after BestEffort). Example: Container with requests.cpu: 500m, limits.cpu: 2, requests.memory: 512Mi, limits.memory: 2Gi → QoS: Burstable. Use case: Most production workloads (APIs, web servers, workers - allow bursting during traffic spikes). (3) BestEffort (lowest priority): No requests or limits set for any container. Pod uses whatever resources available, highest eviction priority (evicted first during node pressure). Example: Container with no resources specified → QoS: BestEffort. Use case: Batch jobs, non-critical workloads, development/testing (NOT production services). Production best practices (2025): (1) Always set requests: Enables proper scheduling (prevents pod starvation on overloaded nodes), establishes resource accounting baseline. Omitting requests → BestEffort QoS → first to be evicted. (2) Set limits to prevent resource exhaustion: CPU limit prevents single container monopolizing node CPU (noisy neighbor problem). Memory limit prevents OOMKilled affecting other pods (kernel OOM killer may kill random processes on node without cgroup limits). (3) Requests < Limits for burstable capacity: Allows scaling beyond baseline during traffic spikes. Example: requests.cpu: 500m, limits.cpu: 2 (baseline 0.5 core, burst to 2 cores when available). Avoid requests = limits (wastes resources, prevents efficient bin packing). (4) Profile actual usage: Use kubectl top pod --containers or Prometheus metrics (container_cpu_usage_seconds_total, container_memory_working_set_bytes) to right-size requests/limits. Set requests to p50 usage, limits to p99 usage (covers 99% of traffic without overprovisioning). (5) CPU: Generous limits or no limit: CPU throttling causes hard-to-debug performance issues (application appears slow for no obvious reason). Either set high limit (2x-4x request) or omit limit entirely (allow bursting to node capacity). Example: requests: 500m, limits: 2000m OR requests: 500m, no limit. (6) Memory: Always set limits: Memory leak without limit → node runs out of memory → kernel OOM killer kills random pods. Set limit to realistic maximum (observed peak + 20% buffer). Example: Observed peak 800Mi → set limit 1Gi. (7) HPA compatibility: Set requests for HPA to work (HPA scales based on % of CPU/memory request, not absolute usage). HPA target: 70% CPU utilization means 70% of requested CPU (700m usage if request 1000m). Common pitfalls (2025): (1) No requests set: Pod scheduled to node with 0.1 CPU available → terrible performance even though pod needs 2 CPU. Scheduler assumes 0 resources needed. Fix: Set realistic requests based on profiling. (2) Requests too high: Over-requesting causes resource waste (nodes appear full but underutilized). Example: Request 4 CPU but use 0.5 CPU average → 87.5% waste. Fix: Right-size to actual usage (p50-p75 percentile). (3) CPU throttling due to low limits: Limit 500m but app needs 1 CPU during peak → severe throttling (200% slowdown). Users report timeouts, degraded performance. Fix: Increase limit or remove (allow bursting). Monitor container_cpu_cfs_throttled_seconds_total metric. (4) Memory limit too low: Limit 512Mi but app peak usage 800Mi → frequent OOMKilled → CrashLoopBackOff → service outage. Fix: Increase limit to observed peak + buffer, investigate memory leaks if usage unbounded. (5) No limits on untrusted workloads: Multi-tenant cluster without limits → tenant A consumes all node resources → tenant B pods evicted. Fix: Enforce LimitRange (default limits per namespace). LimitRange for namespace defaults: Administrators set default requests/limits via LimitRange resource (prevents users forgetting to set resources). Example LimitRange: default CPU request 100m, default CPU limit 500m, default memory request 128Mi, default memory limit 512Mi, max per container 4 CPU / 8Gi memory. Applies to pods created without explicit resources. ResourceQuota for namespace limits: Caps total resource consumption per namespace. Example: namespace dev has quota 10 CPU request, 20 CPU limit, 20Gi memory request, 40Gi memory limit. Prevents single namespace exhausting cluster. New pod creation fails if quota exceeded: Error: exceeded quota: compute-quota, requested: limits.memory=2Gi, used: limits.memory=38Gi, limited: limits.memory=40Gi. Node capacity vs allocatable: Node has total capacity (16 CPU, 64Gi memory) but reserves resources for kubelet, OS, eviction thresholds. Allocatable = capacity - reserved (example: 15.5 CPU, 60Gi memory allocatable). Scheduler uses allocatable for placement decisions. Check with kubectl describe node shows Capacity and Allocatable sections. Overcommitment strategies: (1) Conservative (requests = limits): No overcommitment, 1:1 resource guarantee, lower utilization (40-60% typical), higher cost, maximum reliability. (2) Moderate (requests < limits, 2:1 ratio): Limit = 2x request, allows bursting, 60-80% utilization, good balance for production. (3) Aggressive (no limits or 4:1 ratio): Maximize utilization (80%+), risk of resource contention, requires careful monitoring, suitable for batch/non-critical workloads. Monitoring and optimization: kubectl top pod --containers shows current CPU/memory usage vs requests (CPU% = usage / request). Prometheus queries: container_cpu_usage_seconds_total (CPU usage), container_memory_working_set_bytes (memory usage), container_cpu_cfs_throttled_seconds_total (throttling), kube_pod_container_resource_requests (requested resources), kube_pod_container_resource_limits (limits). VPA (Vertical Pod Autoscaler) recommends optimal requests/limits based on historical usage (auto-updates requests in Deployment). Cost optimization: Right-sizing reduces cloud costs (AWS EKS: 10 nodes @ 40% utilization vs 6 nodes @ 70% utilization saves 40% EC2 costs). Use VPA or manual profiling to eliminate over-provisioning. Troubleshooting: Pod Pending with Insufficient cpu/memory → increase node capacity or reduce pod requests. Pod OOMKilled frequently → increase memory limit or fix memory leak. High CPU throttling → increase CPU limit or optimize app. kubectl describe pod shows Requests and Limits in Containers section.

99% confidence
A

Taints and Tolerations: node affinity mechanism that repels pods unless pods tolerate the taint. Taints applied to nodes, tolerations in pod spec. Node taint structure: key=value:effect. Three effects: (1) NoSchedule - don't schedule new pods without matching toleration (existing pods unaffected). (2) PreferNoSchedule - soft NoSchedule, scheduler tries to avoid but not guaranteed. (3) NoExecute - evict existing pods without toleration AND don't schedule new pods (hard eviction with grace period). Toleration syntax in pod spec: tolerations[].key, operator (Equal/Exists), value, effect, tolerationSeconds (for NoExecute, delays eviction). Use cases: (1) Dedicated nodes - GPU nodes (gpu=true:NoSchedule), high-memory workloads (memory=high:NoSchedule). (2) Node maintenance - taint nodes before drain (node.kubernetes.io/unschedulable:NoSchedule added by kubectl cordon). (3) Node failure isolation - automatic taints (node.kubernetes.io/not-ready:NoExecute, node.kubernetes.io/unreachable:NoExecute with 300s toleration default). (4) Special hardware - FPGA, InfiniBand networking. Example: kubectl taint nodes node1 gpu=nvidia-a100:NoSchedule. Pod spec tolerates: key: gpu, operator: Equal, value: nvidia-a100, effect: NoSchedule. Special toleration: operator: Exists (tolerates all taints with key, ignores value). Built-in taints: node.kubernetes.io/disk-pressure, node.kubernetes.io/memory-pressure, node.kubernetes.io/pid-pressure. Production pattern: taint specialized nodes, DaemonSets use Exists tolerations to run on ALL nodes including tainted ones.

99% confidence
A

NetworkPolicy: namespace-scoped resource specifying how groups of pods communicate with each other and external endpoints (pod-level firewall). Requires CNI plugin with NetworkPolicy support (Calico, Cilium, Weave Net, Canal). Default behavior: pods non-isolated, accept traffic from any source. Once NetworkPolicy selects pod, pod becomes isolated and only allows traffic matching NetworkPolicy rules. Key components: (1) podSelector - selects pods NetworkPolicy applies to (label selector, empty {} selects all pods in namespace). (2) policyTypes - [Ingress, Egress] or both. (3) ingress rules - allowed inbound traffic sources: podSelector (pods in same namespace), namespaceSelector (pods in selected namespaces), ipBlock (CIDR ranges like 172.17.0.0/16, except: [172.17.1.0/24]). (4) egress rules - allowed outbound destinations (same selectors). (5) ports - protocol (TCP/UDP/SCTP) + port number. Policies additive: pod matched by multiple NetworkPolicies allows union of all rules. Best practice patterns: (1) Default deny all: podSelector: {}, policyTypes: [Ingress, Egress] (no rules = deny all). (2) Allow DNS: egress to kube-dns on port 53 UDP. (3) Allow specific namespace: namespaceSelector: {matchLabels: {name: frontend}}. Example production policy: allow frontend pods to access backend on port 8080 only. Security impact: isolates microservices, implements zero-trust networking, prevents lateral movement. Limitations: no layer 7 (HTTP path) filtering, no DNS-based egress rules (must use IPs/CIDRs), performance overhead minimal (<5% with Cilium eBPF). Essential for multi-tenant clusters and PCI/HIPAA compliance.

99% confidence
A

Affinity/Anti-Affinity: advanced pod scheduling constraints using label selectors and topology. Three types: (1) Node Affinity - attracts pods to nodes (enhanced nodeSelector). requiredDuringSchedulingIgnoredDuringExecution (hard constraint, must match), preferredDuringSchedulingIgnoredDuringExecution (soft constraint, weight 1-100, scheduler tries to match). Uses node labels (disktype=ssd, gpu=nvidia-a100, instance-type=m5.xlarge). (2) Pod Affinity - schedules pods together on same topology domain (node, zone, region). Example: schedule web pods on same nodes as cache pods for low latency. (3) Pod Anti-Affinity - spreads pods across topology domains for HA. Example: spread replicas across nodes (topology.kubernetes.io/zone) to survive zone failure. Topology keys: kubernetes.io/hostname (node-level), topology.kubernetes.io/zone (zone-level), topology.kubernetes.io/region (region-level). Label selectors: matchExpressions with operators (In, NotIn, Exists, DoesNotExist). Use cases: (1) Node affinity - GPU workloads (gpu=true), SSD storage (disktype=ssd), spot instances vs on-demand. (2) Pod affinity - co-locate frontend with Redis cache, microservices needing low-latency communication. (3) Pod anti-affinity - HA replicas across zones, avoid single point of failure. Example anti-affinity: replicas spread using labelSelector: {app: web}, topologyKey: kubernetes.io/hostname (each pod on different node). Performance impact: pod anti-affinity expensive at scale (>1000 pods), scheduler evaluates all nodes - use preferredDuringScheduling for large clusters. Production best practice: combine with PodDisruptionBudget for HA (ensure minimum replicas during node drain/upgrades).

99% confidence
A

kubectl drain (2025): Safely evict all pods from node before maintenance, essential for cluster operations without service disruption. Purpose: Gracefully migrate workloads off node, respect application availability requirements (PodDisruptionBudgets), prepare node for downtime (OS patching, hardware replacement, kernel upgrades, node decommissioning). Drain workflow: (1) Marks node unschedulable (cordon) - new pods cannot be scheduled to node (spec.unschedulable: true). (2) Evicts all pods from node - sends termination signal (SIGTERM), waits for graceful shutdown (respects terminationGracePeriodSeconds, default 30s), force-kills pods after grace period (SIGKILL). (3) Pods rescheduled to other nodes - controllers (Deployment, StatefulSet, DaemonSet) recreate pods on available nodes. Syntax: kubectl drain [flags]. Example: kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data. Key flags: (1) --ignore-daemonsets: Skip DaemonSet-managed pods (required, DaemonSet pods tied to nodes, cannot be drained). Without flag: error - cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore). DaemonSet pods remain running on drained node (node monitoring agents, log collectors, CNI plugins must stay active for proper shutdown). (2) --delete-emptydir-data: Allow deletion of pods using emptyDir volumes (ephemeral data lost). Without flag: error if pods have emptyDir - cannot delete Pods with local storage. Use with caution (verify data is non-critical or backed up elsewhere). Example: Temporary cache, build artifacts, non-persistent state. (3) --force: Delete pods not managed by ReplicationController/ReplicaSet/Job/DaemonSet/StatefulSet (standalone pods, orphaned pods). Without flag: error - cannot delete pods not managed by controller. Force required for one-off pods created with kubectl run. Warning: Standalone pods NOT recreated after deletion (permanent data loss). (4) --grace-period: Override pod's terminationGracePeriodSeconds. Example: --grace-period=120 waits 2 minutes before force-kill (allows long-running requests to complete). --grace-period=0 immediate termination (dangerous, use only for hung pods). Default: respects pod's configured grace period (typically 30s). (5) --timeout: Total drain timeout (default 0 = infinite). Example: --timeout=5m fails drain if not completed in 5 minutes (prevents indefinite wait on stuck pods). Useful for automated workflows (CI/CD pipelines, autoscaling). (6) --pod-selector: Drain only pods matching label selector. Example: --pod-selector='app!=critical' drains non-critical pods only (phased maintenance). Pod eviction order: (1) BestEffort QoS pods evicted first (no resource requests/limits). (2) Burstable QoS pods evicted second (requests < limits). (3) Guaranteed QoS pods evicted last (requests = limits, critical workloads). Within same QoS: lower priority pods evicted first (spec.priorityClassName). PodDisruptionBudget (PDB) interaction: Drain respects PDBs to maintain application availability. Example PDB: minAvailable: 2 for Deployment with 3 replicas. Drain attempts eviction → checks PDB → only evicts if 2+ replicas remain healthy across cluster → waits for new pod to become Ready before evicting next → ensures continuous availability. PDB blocking drain: If PDB cannot be satisfied (too few healthy replicas), drain pauses with error: Cannot evict pod as it would violate the pod's disruption budget. Resolution: Scale up replicas, relax PDB (lower minAvailable), or wait for unhealthy pods to recover. kubectl cordon (Preventive measure): Marks node unschedulable without evicting existing pods (spec.unschedulable: true). Syntax: kubectl cordon . Use cases: (1) Gradual workload migration - cordon first, wait for natural pod churn (deployments update, pods restart), then drain remaining pods. (2) Prevent new pods during investigation - cordon unstable node to avoid new workload placement while debugging. (3) Temporary capacity reduction - cordon excess nodes during low-traffic periods. Effect: Existing pods continue running (no disruption), new pods schedule to other nodes. kubectl get nodes shows SchedulingDisabled status for cordoned nodes. kubectl uncordon (Resume scheduling): Marks node schedulable after maintenance (spec.unschedulable: false). Syntax: kubectl uncordon . Effect: Node becomes eligible for pod scheduling again, existing pods NOT automatically rebalanced (use descheduler for rebalancing). Workflow: kubectl drain node-1 → perform maintenance (reboot, OS patch, hardware fix) → node rejoins cluster → kubectl uncordon node-1 → new pods can schedule to node-1. Production use cases (2025): (1) OS patching: kubectl drain node-1 --ignore-daemonsets → ssh node-1 'sudo apt-get update && sudo apt-get upgrade -y && sudo reboot' → wait for node ready → kubectl uncordon node-1. Repeat for all nodes (rolling update pattern). (2) Kubernetes version upgrade: Drain control plane nodes one at a time → upgrade kubelet/API server → uncordon. Then drain worker nodes → upgrade kubelet → uncordon. Ensures control plane availability (3 control plane nodes, drain 1 at a time maintains quorum). (3) Hardware replacement: Drain node with failing disk → physically replace hardware → reconfigure node → kubectl uncordon. Prevents data loss (workloads migrated before hardware failure). (4) Node scaling down: Drain node before removing from cluster (AWS ASG termination lifecycle hook calls drain script before instance termination). Prevents abrupt pod termination (graceful migration vs hard kill). (5) Cluster rebalancing: Drain overloaded nodes to redistribute pods (combine with pod anti-affinity rules for optimal spread). Best practices (2025): (1) Always drain before node downtime: Never reboot/shutdown node without draining (causes 5-60s pod unavailability depending on liveness probe settings, potential data loss for StatefulSets). (2) Use PodDisruptionBudgets: Define PDBs for critical apps to enforce minimum availability during drains (example: PDB with minAvailable: 51% ensures majority of pods stay running). (3) Drain control plane nodes one at a time: Never drain multiple control plane nodes simultaneously (requires etcd quorum, 3-node cluster needs 2 nodes for write operations). (4) Monitor drain progress: kubectl get pods -o wide --watch shows pod migration in real-time (pods transition Terminating → new pods appear on other nodes → Running). (5) Timeout for automation: Use --timeout=10m in automated scripts (prevents infinite hangs on stuck pods, fails fast for retry logic). (6) Backup before drain: For StatefulSets with persistent data (databases), verify recent backups before draining (protects against pod scheduling failures, PV attachment issues). Common issues (2025): (1) Drain hangs indefinitely: Pod stuck in Terminating state (finalizers not completing, volume detach stuck). Resolution: Investigate stuck pod (kubectl describe pod), force-delete if necessary (kubectl delete pod --force --grace-period=0), check CSI driver logs for volume issues. (2) PDB blocks drain: Error: Cannot evict pod as it would violate disruption budget. Drain cannot proceed (not enough healthy replicas to satisfy PDB). Resolution: Scale up Deployment (kubectl scale deployment myapp --replicas=5), temporarily delete PDB (risky), or wait for pods to become Ready. (3) DaemonSet pods prevent drain: Forgot --ignore-daemonsets flag → drain fails with error. Resolution: Always use --ignore-daemonsets (DaemonSet pods are node-specific, expected to run on drained node). (4) Local data loss: Pods with emptyDir volumes evicted → data lost. Resolution: Use --delete-emptydir-data only after confirming data is non-critical. Migrate critical data to PersistentVolumes before draining. Drain vs delete node: kubectl drain (graceful eviction, pods rescheduled, respects PDBs, safe for production) vs kubectl delete node (forceful removal, pods hard-killed, no PDB respect, use only for failed/unreachable nodes). Always prefer drain for planned maintenance. Automated drain with cluster-autoscaler: Cluster-autoscaler automatically drains nodes before scale-down (removes underutilized nodes). Respects PDBs, uses --grace-period, triggers only when safe (no PDB violations). Configure via autoscaler flags: scale-down-delay-after-add (wait 10m after node add before scale-down), scale-down-unneeded-time (node idle for 10m triggers drain + removal). Drain duration estimation: Typical drain: 30s-2m (depends on pod count, grace periods, PDB constraints). Slow drain: 5-15m (large StatefulSets with slow volume detach, strict PDBs limiting parallelism). Troubleshooting slow drains: Check events (kubectl get events --field-selector involvedObject.name=), verify PDB status (kubectl get pdb), inspect stuck pods (kubectl describe pod). Kubernetes version compatibility: kubectl drain syntax stable since Kubernetes 1.5, --pod-selector added in 1.21, PDB v1 (stable) since 1.21. Use kubectl version to verify client/server compatibility (drain requires kubectl >= cluster version).

99% confidence
A

Init containers: specialized containers running before app containers in pod, must complete successfully (exit 0) before app containers start. Run sequentially in spec.initContainers order (wait for each to finish before next). Lifecycle: init container 1 runs to completion → init container 2 runs → ... → all app containers start in parallel. Restart behavior: if init container fails, kubelet restarts entire pod (unless restartPolicy: Never). Shared resources: share pod volumes, network namespace (localhost communication), security context with app containers. Different images: use specialized utilities not needed in app container (curl, git, database CLI tools). Use cases: (1) Wait for dependencies - loop until database/service ready (while ! nc -z db 5432; do sleep 1; done). (2) Database migrations - run schema migrations before app starts (liquibase, flyway, django migrate). (3) Git clone - fetch code/config from repository into shared volume. (4) Security setup - fetch secrets from Vault, generate TLS certificates, set file permissions (chmod, chown on volumes). (5) Configuration generation - template config files, fetch remote config. (6) Pre-populate data - download datasets, seed caches. Example: init container with busybox:1.36 runs sh -c 'until nslookup mydb; do sleep 2; done', app container starts only after mydb DNS resolves. Resource limits: init containers can have different resource requests/limits than app containers (higher limits for one-time tasks). Probes: liveness/readiness probes don't apply to init containers (they run to completion). Production pattern: database-dependent app uses init container to wait for DB readiness, ensures app doesn't crash on startup. Security benefit: separation of privileges (init container runs as root for setup, app container runs as non-root user).

99% confidence
A

Kubernetes API Server (kube-apiserver, 2025): Central control plane component serving as front-end for Kubernetes cluster, exposing RESTful HTTP API for all cluster operations (create/read/update/delete resources). Core responsibilities: (1) API Gateway: All cluster interactions go through API server (kubectl commands, controller managers, scheduler, kubelet, custom controllers, external clients). Single entry point for cluster state modifications. Exposes API groups (core/v1 for pods/services, apps/v1 for deployments, batch/v1 for jobs). API discovery via kubectl api-resources (lists all available resource types + API groups). (2) Authentication: Verifies client identity before processing requests. Methods: X.509 client certificates (kubectl uses ~/.kube/config with client-cert-data), bearer tokens (ServiceAccount tokens in pods), OpenID Connect (OIDC for SSO integration with Google/Azure AD/Okta), webhook token authentication (custom auth backends). Request flow: Client sends request with credentials → API server validates credentials → extracts user identity (username, UID, groups) → passes to authorization. Failed auth → HTTP 401 Unauthorized. (3) Authorization: Determines if authenticated user can perform requested operation. Modes: RBAC (Role-Based Access Control, default and recommended), Node authorization (kubelets access only their node resources), Webhook (external authorization service), ABAC (Attribute-Based, deprecated). RBAC flow: User requests kubectl get pods -n production → API server checks RoleBindings/ClusterRoleBindings → verifies user has get permission for pods resource in production namespace → allow/deny. Failed authz → HTTP 403 Forbidden. (4) Admission Control: Validates and mutates requests after authentication/authorization, before persisting to etcd. Two phases: Mutating admission (modifies request, examples: DefaultStorageClass adds default storage class to PVC, ServiceAccount injects default SA token into pods) → Validating admission (accepts/rejects request, examples: ResourceQuota enforces namespace quotas, PodSecurityAdmission enforces security standards). Built-in admission controllers (30+ including NamespaceLifecycle, LimitRanger, PersistentVolumeClaimResize, ValidatingAdmissionWebhook, MutatingAdmissionWebhook). Custom admission: ValidatingWebhookConfiguration and MutatingWebhookConfiguration for external policy enforcement (OPA, Kyverno, Falco). Failed admission → HTTP 400/403 with reason. (5) API Object Validation: Validates resource manifests against OpenAPI schema (ensures required fields present, correct types, valid values). Example: Deployment requires spec.selector matching template labels, rejects invalid image names, enforces API version compatibility. Schema validation prevents malformed resources from reaching etcd. (6) Persistence to etcd: Only component with direct etcd access (read/write cluster state). API server serializes objects to JSON/Protobuf → writes to etcd with optimistic concurrency control (resourceVersion prevents lost updates). All cluster state in etcd (pods, services, secrets, configmaps, resource quotas, RBAC policies). etcd failure → cluster read-only (API server serves cached data, no writes accepted). (7) Watch Mechanism: Clients subscribe to resource changes via HTTP long-polling (kubectl get pods --watch, controller watch loops). API server notifies clients of ADDED/MODIFIED/DELETED events in real-time. Enables reactive controllers (Deployment controller watches ReplicaSets, creates pods when scaled up). Watches efficient (HTTP/2 multiplexing, bookmark events prevent full re-list). (8) API Versioning: Manages multiple API versions simultaneously (v1alpha1, v1beta1, v1 stable). Allows deprecation without breaking existing clients (apps/v1beta1 Deployment → apps/v1 gradual migration). Conversion webhooks translate between versions (external CRD versioning). API Server Architecture (2025): (1) Stateless: No local state (all state in etcd), enables horizontal scaling (run 3+ API servers for HA). Requests load-balanced via cloud LB or kube-vip/HAProxy. (2) RESTful API: Standard HTTP verbs (GET, POST, PUT, PATCH, DELETE), resources addressable via URLs (/api/v1/namespaces/default/pods/nginx, /apis/apps/v1/namespaces/production/deployments). Content negotiation (accepts application/json, application/yaml, application/vnd.kubernetes.protobuf). (3) OpenAPI Spec: Self-documenting API via /openapi/v2 endpoint (Swagger/OpenAPI 2.0) and /openapi/v3 (OpenAPI 3.0, Kubernetes 1.24+). Used by kubectl explain, client generators, API discovery tools. (4) HTTP/2 and gRPC: Supports HTTP/2 for efficient multiplexing (single connection for multiple watches), gRPC for high-performance internal communication (kubelet → API server pod logs streaming). High Availability (2025): Production clusters run 3+ API server replicas across availability zones (odd number for etcd quorum). Load balancer distributes requests (AWS NLB, GCP GLB, Azure LB, or software LB like kube-vip). API server failure: Other replicas handle requests (stateless design allows instant failover), kubectl retries on connection failure, controllers re-establish watches. etcd requires majority quorum (3-node etcd needs 2 healthy nodes, 5-node needs 3). Performance & Scalability (2025): API server handles 1000s QPS (queries per second) in large clusters. Optimizations: (1) Caching: In-memory cache reduces etcd load (watch cache for list operations, reduces full scans), TTL-based cache invalidation. (2) Priority and Fairness (APF): Kubernetes 1.20+ feature prevents API server overload via request queuing. Classifies requests into priority levels (system-leader-election, workload-high, workload-low, catch-all), enforces per-priority-level concurrency limits and queuing. Protects API server from thundering herd (mass pod creation doesn't starve critical requests like leader election). (3) Pagination: List operations support pagination (limit/continue tokens), prevents massive responses overwhelming clients (kubectl get pods --limit=500 returns 500 pods + continue token for next batch). (4) Field selectors: Filter resources server-side (kubectl get pods --field-selector status.phase=Running) reduces network transfer. API Server Flags (Key configurations): (1) --etcd-servers: Comma-separated etcd endpoints (https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379). (2) --authorization-mode: Authorization plugins (RBAC,Node default for secure clusters). (3) --enable-admission-plugins: Enabled admission controllers (NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,MutatingAdmissionWebhook,ValidatingAdmissionWebhook). (4) --service-cluster-ip-range: CIDR for Service ClusterIPs (10.96.0.0/12 typical). (5) --secure-port: HTTPS port for API server (6443 default, was 443 in older versions). (6) --tls-cert-file, --tls-private-key-file: TLS certificate for HTTPS. (7) --client-ca-file: CA bundle for client certificate authentication. API Server Endpoints (2025): https://:6443/api/v1 (core API group: pods, services, namespaces, configmaps, secrets, nodes, persistentvolumes), https://:6443/apis (API groups: apps/v1 for deployments/statefulsets, batch/v1 for jobs/cronjobs, networking.k8s.io/v1 for ingresses/networkpolicies). Healthz endpoints: /livez (liveness), /readyz (readiness), /healthz (overall health, deprecated). Metrics: /metrics (Prometheus format, API server performance metrics). Security hardening (2025): (1) Disable anonymous auth (--anonymous-auth=false), (2) Enable audit logging (--audit-log-path, --audit-policy-file for tracking all API requests), (3) Restrict API access with NetworkPolicies (only allow control plane → API server traffic), (4) Use OIDC for human users (avoid long-lived kubeconfig certs), (5) Enable encryption at rest (--encryption-provider-config encrypts Secrets in etcd). Troubleshooting: API server unreachable: Check kubectl cluster-info (shows API server URL), verify DNS resolution (nslookup kubernetes.default.svc.cluster.local from pod), test connectivity (curl -k https://:6443/livez). High latency: Check API server metrics (apiserver_request_duration_seconds_bucket shows p99 latency), review audit logs for expensive queries (large LIST operations), enable APF (prevent overload). etcd issues: API server logs show etcdserver: request timed out → check etcd health (etcdctl endpoint health), verify etcd performance (disk I/O, network latency < 10ms for healthy etcd). Critical nature: API server down → cluster control plane unusable (cannot create/update/delete resources, existing workloads continue running but unmanageable, kubectl commands fail, controllers cannot reconcile state). Always run 3+ replicas for production HA.

99% confidence
A

Kubernetes Labels (2025): Key-value pairs attached to Kubernetes objects (pods, services, nodes, deployments, namespaces, pvs) for organization, selection, and grouping. Fundamental mechanism for loosely coupling resources in Kubernetes (Services find pods, Deployments manage pods, kubectl filters resources). Label characteristics: (1) Non-unique: Multiple objects can have same label (all production pods: env=prod), single object can have multiple labels (pod has env=prod, tier=frontend, version=v2.1, team=platform). (2) Metadata only: Labels don't directly affect object behavior (unlike annotations which can trigger controller actions), used solely for selection and organization. (3) Key constraints: Valid label key format: [prefix/]name. Prefix optional (kubernetes.io/, example.com/, max 253 chars), name required (max 63 chars, alphanumeric + dash/underscore/dot, must start/end with alphanumeric). Examples: app, tier, environment, kubernetes.io/arch, example.com/team. (4) Value constraints: Max 63 chars, alphanumeric + dash/underscore/dot, can be empty string. Examples: frontend, v1.2.3, prod, team-platform, empty. Common label patterns (2025): (1) Environment: env=prod, env=staging, env=dev (separate environments in same cluster). (2) Application tier: tier=frontend, tier=backend, tier=database, tier=cache (microservice architecture). (3) Application name: app=nginx, app=mysql, app=redis (identify application type). (4) Version: version=v1.2.3, version=canary, version=stable (A/B testing, canary deployments). (5) Team ownership: team=platform, team=data, team=ml (multi-tenant clusters). (6) Cost center: cost-center=engineering, cost-center=marketing (chargeback/showback). (7) Release: release=2025-01, release=stable, release=beta (track release cohorts). Kubernetes-managed labels: (1) kubernetes.io/hostname: node-1 (node hostname, auto-applied to nodes). (2) topology.kubernetes.io/zone: us-east-1a (availability zone, for zone-aware scheduling). (3) topology.kubernetes.io/region: us-east-1 (cloud region). (4) kubernetes.io/arch: amd64 (CPU architecture: amd64, arm64). (5) kubernetes.io/os: linux (operating system: linux, windows). (6) node.kubernetes.io/instance-type: m5.xlarge (cloud instance type). Selectors (2025): Query language for filtering resources by labels. Two types: equality-based and set-based. (1) Equality-based selectors: Matches exact key-value pairs or inequality. Syntax: key=value (equals), key!=value (not equals). Examples: env=prod (selects objects with env label = prod), tier!=frontend (selects objects where tier label ≠ frontend or tier label absent). Multiple conditions: env=prod,tier=frontend (AND logic, both conditions must match). kubectl usage: kubectl get pods -l env=prod (list pods with env=prod label), kubectl get pods -l env=prod,tier=backend (list pods matching both labels). (2) Set-based selectors: Matches labels using set operations (in, notin, exists). Syntax: key in (value1,value2) (label value in set), key notin (value1,value2) (label value not in set or absent), key (label exists, any value), !key (label does not exist). Examples: env in (prod,staging) (selects prod OR staging), tier notin (frontend,backend) (selects objects where tier not frontend/backend or tier absent), app (selects objects with app label, any value), !deprecated (selects objects without deprecated label). kubectl usage: kubectl get pods -l 'env in (prod,staging)' (quotes required for shell parsing), kubectl get pods -l 'tier notin (frontend),env=prod' (set-based + equality-based combined). Selector usage in Kubernetes resources: (1) Services: Select pods to route traffic. Service spec.selector: app=nginx, tier=frontend. Service creates endpoints for all pods matching labels. Dynamic pod discovery (new pods with matching labels automatically added to Service, deleted pods removed). Example: Service nginx-service with selector app=nginx routes traffic to all pods labeled app=nginx (3 pods initially, scale to 10 → Service automatically routes to all 10). (2) Deployments: Manage pod replicas. Deployment spec.selector.matchLabels: app=web, env=prod. Deployment creates/deletes pods to maintain desired replica count for matching labels. Rolling updates: Deployment creates new ReplicaSet with updated pod template labels (version=v2), gradually scales new pods up and old pods down. (3) ReplicaSets: Maintain pod count. ReplicaSet spec.selector matches pods, creates new pods when count falls below desired replicas. Selector immutable after creation (cannot change selector on existing ReplicaSet, must delete and recreate). (4) Jobs: Track job completion. Job spec.selector matches pods created by job, counts successful completions. (5) NetworkPolicies: Allow/deny traffic based on pod labels. NetworkPolicy podSelector: tier=backend allows ingress from pods with tier=frontend (label-based micro-segmentation). (6) Node affinity: Schedule pods to nodes with specific labels. Pod spec.affinity.nodeAffinity.requiredDuringScheduling matchExpressions: key=disktype, operator=In, values=[ssd] (schedule only to nodes labeled disktype=ssd). Production best practices (2025): (1) Consistent labeling scheme: Define organizational standard (env, app, version, team mandatory for all resources). Document in runbooks, enforce via ValidatingAdmissionWebhook or OPA policies. (2) Hierarchical labels: Use prefixes for grouping (example.com/app=nginx, example.com/team=platform keeps related labels together). (3) Version tracking: Always include version label (version=v1.2.3 enables canary deployments, gradual rollouts, rollback). (4) Avoid sensitive data: Labels visible in kubectl output, API responses, audit logs (don't use customer-id=12345, use cost-center=team-alpha instead). (5) Label everything: Pods, Services, Deployments, PVCs, ConfigMaps, Secrets (enables cross-resource queries like kubectl get all -l app=nginx shows all resources for app). (6) Set-based selectors for flexibility: Use set-based in NetworkPolicies, complex queries (env in (prod,staging) more flexible than separate env=prod, env=staging rules). Recommended labels (Kubernetes documentation): app.kubernetes.io/name: mysql (application name), app.kubernetes.io/instance: mysql-abcxzy (unique instance ID), app.kubernetes.io/version: 5.7.21 (application version), app.kubernetes.io/component: database (architecture tier), app.kubernetes.io/part-of: wordpress (parent application), app.kubernetes.io/managed-by: helm (management tool). Standardized labels enable tooling integration (Helm, Kustomize, monitoring systems recognize these labels). kubectl label commands: Add label: kubectl label pods nginx-pod env=prod (adds env=prod to pod nginx-pod). Update label: kubectl label pods nginx-pod env=staging --overwrite (changes env from prod to staging). Remove label: kubectl label pods nginx-pod env- (trailing dash removes label). Label multiple resources: kubectl label pods -l app=nginx tier=frontend (adds tier=frontend to all pods with app=nginx). Selectors in YAML manifests: Service selector (equality-based only): selector: app=nginx, tier=frontend. Deployment selector (supports set-based): matchLabels: app=nginx (equality), matchExpressions: key=tier, operator=In, values=[frontend,backend] (set-based). NetworkPolicy selector: podSelector.matchLabels: role=db, namespaceSelector.matchExpressions: key=env, operator=In, values=[prod]. Label immutability: Deployment/StatefulSet selector immutable after creation (changing selector creates new resource). Pod labels mutable (can add/remove labels on running pods with kubectl label, useful for removing pods from Service temporarily). Label limits: Max 63 labels per object (theoretical, practical limit lower for etcd storage). Large label sets increase etcd storage, API response size (prefer annotations for large metadata like JSON configs). Common pitfalls (2025): (1) Selector mismatch: Deployment selector app=nginx but pod template labels app=web → no pods created, error: selector does not match template labels. Fix: Ensure selector matches pod template labels exactly. (2) Typos in labels: Service selector app=ngnix (typo) but pods labeled app=nginx → no endpoints, Service routes to nothing. Fix: Use label validation in CI/CD, test Service endpoints after deployment. (3) Overwriting labels accidentally: kubectl label pod nginx env=staging without --overwrite when env=prod exists → error: already has a value (prod). Fix: Use --overwrite for intentional updates. (4) Using labels for large data: Storing JSON config in label → hits 63-char limit, causes etcd bloat. Fix: Use annotations for large metadata (annotations have 256KB limit vs labels 63-char). Monitoring and troubleshooting: List objects by label: kubectl get pods -l env=prod --show-labels (shows all pod labels). Check Service endpoints: kubectl get endpoints nginx-service (shows selected pod IPs, empty if no pods match selector). Describe resource to see labels: kubectl describe pod nginx-pod shows Labels section. Query metrics by label: Prometheus QL kube_pod_labels{label_app=nginx, label_env=prod} (filter metrics by pod labels). Labels vs Annotations: Labels: Short values (63 chars), used for selection/grouping, queried by selectors, indexed for performance. Annotations: Large values (256KB), store metadata (git commit SHA, deployment timestamp, config JSON), not used for selection, not indexed. Use label for selection, annotation for metadata.

99% confidence
A

Kubernetes Operator: application-specific controller extending Kubernetes API to create, configure, and manage complex stateful applications using domain-specific knowledge. Pattern: Custom Resource Definition (CRD) + custom controller watching CRD resources and reconciling desired state. Components: (1) CRD - defines custom resource (example: kind: PostgresCluster with spec: replicas, version, storage). (2) Controller - watches CRD, implements reconciliation loop (observe current state, compare to desired state, take actions). (3) Domain logic - encodes operational knowledge (deployment, scaling, backup, recovery, upgrades, failover). Solves Day 2 operations automation: (1) Database management - PostgreSQL Operator handles replication setup, automated backups, point-in-time recovery, failover, version upgrades. (2) Certificate management - cert-manager Operator issues/renews TLS certificates from Let's Encrypt/Vault. (3) Monitoring - Prometheus Operator manages Prometheus instances, ServiceMonitor CRDs, alert rules. (4) Messaging - Strimzi Kafka Operator handles broker scaling, topic management, security. (5) Backup - Velero Operator automates cluster backup/restore. Examples: Zalando PostgreSQL Operator (production-grade PostgreSQL clusters with streaming replication), MongoDB Enterprise Operator (sharded clusters, backups to S3), Redis Operator (Redis clusters with Sentinel). Operator maturity levels: Level 1 (basic install), Level 2 (seamless upgrades), Level 3 (full lifecycle), Level 4 (deep insights), Level 5 (auto-pilot, self-healing). Development: Operator SDK (Go, Ansible, Helm-based operators), Kubebuilder framework. Operator Hub: operatorhub.io repository with 200+ certified operators. Problems solved: reduces operational complexity, encodes tribal knowledge, enables self-service for developers, ensures consistency across clusters. Production benefit: PostgreSQL Operator handles 99% of DBA tasks automatically (replication, backups, failover in <30s). Essential for running stateful apps (databases, message queues, caches) in Kubernetes without manual intervention.

99% confidence
A

kubectl get pods lists pods with status, ready count, restarts, and age. Essential flags: -n (specific namespace), --all-namespaces or -A (all namespaces), -o wide (node and IP info), -o yaml/json (full definition), --show-labels (display labels), -l app=nginx (label filter), --field-selector status.phase=Running (field filter), --watch (stream updates). Output columns: NAME, READY (e.g., 2/2 = 2/2 containers ready), STATUS (Running/Pending/Failed/CrashLoopBackOff), RESTARTS, AGE. Example: kubectl get pods -n production -l tier=frontend --field-selector spec.nodeName=node-1 shows frontend pods on node-1.

99% confidence
A

kubectl describe pod shows comprehensive pod details: metadata (labels, annotations), node placement, IP addresses, container specs (image, ports, env vars, probes), resource requests/limits, conditions (PodScheduled, Ready), volumes, and events. Critical for troubleshooting: Last State shows restart reasons (Exit Code 137 = OOMKilled, 1 = error), Events section shows lifecycle issues (FailedScheduling = insufficient resources, ImagePullBackOff = registry auth failure). Example: kubectl describe pod myapp-7d8f9c-xk2lp displays last hour of events with timestamps. Essential for diagnosing pod failures.

99% confidence
A

kubectl logs streams container stdout/stderr. Key flags: -f or --follow (continuous stream like tail -f), --tail=100 (last 100 lines), --since=1h (last hour), --timestamps (add timestamps), --previous or -p (logs from crashed container, critical for CrashLoopBackOff debugging), -c (specific container if pod has multiple), --all-containers (all containers, Kubernetes 1.27+). Example: kubectl logs myapp-7d8f9c-xk2lp -c app --tail=500 -f. For crash debugging: kubectl logs --previous shows why container exited before restart. Multi-container pods require -c flag.

99% confidence
A

kubectl exec -it -- /bin/bash executes interactive shell in container. Flags: -it (interactive terminal), -c (specify container if multiple), -- (separates kubectl flags from command). Common commands: /bin/bash or /bin/sh (shell), env (environment vars), ps aux (processes), curl localhost:8080/health (test endpoints). Example: kubectl exec -it myapp-7d8f9c-xk2lp -c sidecar -- curl localhost:15000/stats. Non-interactive: kubectl exec myapp-7d8f9c-xk2lp -- ls /data. Requires pods/exec RBAC permission. Essential for debugging container internals.

99% confidence
A

kubectl delete pod deletes pod with graceful termination (default 30s grace period: SIGTERM then SIGKILL). Flags: --force (immediate SIGKILL, use only for hung pods), --grace-period=60 (custom grace period in seconds), --now (alias for --grace-period=1), -l app=nginx (delete all matching label selector, dangerous in production). Deployment/ReplicaSet-managed pods recreate immediately (controller maintains desired replicas). For permanent deletion, delete owning controller: kubectl delete deployment myapp. StatefulSet pods recreate with same identity. Example: kubectl delete pod myapp-7d8f9c-xk2lp --grace-period=120 waits 2 minutes for shutdown.

99% confidence
A

kubectl port-forward : forwards local port to pod port, bypassing Service for direct debugging. Syntax: kubectl port-forward 8080:80 (local 8080 → pod 80), kubectl port-forward :80 (random local port), multiple ports: 8080:80 9090:9090. Works with Services: kubectl port-forward svc/myapp 8080:80 (random pod). Defaults to localhost only (http://localhost:8080), use --address 0.0.0.0 for external access (security risk). Example: kubectl port-forward myapp-7d8f9c-xk2lp 3000:3000 -n production, then curl http://localhost:3000/health tests pod directly. Essential for debugging without LoadBalancer/Ingress.

99% confidence
A

kubectl top pod shows CPU and memory usage for pods, requires metrics-server installed (check: kubectl get deployment metrics-server -n kube-system). Flags: --containers (per-container usage in multi-container pods), --all-namespaces or -A, -l app=nginx (label filter), --sort-by=cpu or --sort-by=memory. Output: NAME, CPU (millicores, 250m = 0.25 core), MEMORY (bytes, 128Mi), CPU%/MEM% (percentage of requests if set). Example: kubectl top pod --all-namespaces --sort-by=memory shows memory hogs. Shows actual usage, not requests/limits - compare with kubectl describe for capacity vs usage. Essential for identifying resource bottlenecks.

99% confidence
A

kubectl debug creates ephemeral debug container in running pod (Kubernetes 1.23+ stable). Syntax: kubectl debug -it --image=busybox:1.36 --target= shares process namespace with target, sees all processes. Use case: debug distroless containers without shell (no /bin/bash in production images). Example: kubectl debug myapp-7d8f9c-xk2lp -it --image=nicolaka/netshoot --target=app attaches debugging tools to production pod without modifying original image. Ephemeral container is temporary, removed when detached. Essential for debugging minimal production images (distroless, scratch-based) that lack debugging tools.

99% confidence