A Pod is a group of one or more containers with shared storage and network resources, running in a shared context. Pods are the smallest deployable units in Kubernetes because they model an application-specific 'logical host' where containers are tightly coupled and co-located. Each Pod gets its own unique cluster-wide IP address. Usually you don't create Pods directly; instead use Deployments or Jobs.
Kubernetes FAQ & Answers
38 expert Kubernetes answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
38 questionsA ReplicaSet maintains a stable set of replica Pods at any given time, ensuring availability. A Deployment is a higher-level controller that manages ReplicaSets, providing declarative updates, rolling updates with maxUnavailable/maxSurge controls, and rollback via 'kubectl rollout undo'. Deployments create ReplicaSets automatically - each update creates a new ReplicaSet while scaling down the old one. Use Deployments for all stateless workloads; never manage ReplicaSets directly. Example: 'kubectl create deployment nginx --image=nginx:1.27 --replicas=3' creates Deployment managing a ReplicaSet with 3 Pods. View ReplicaSets with 'kubectl get rs'.
ClusterIP (default): exposes Service on internal IP, accessible only within cluster. NodePort: exposes Service on static port (30000-32767) on each Node's IP, accessible externally. LoadBalancer: provisions cloud load balancer with external IP. ExternalName: maps Service to DNS (CNAME record). Headless (clusterIP: None): used with StatefulSets for direct Pod discovery. Use ClusterIP for internal microservices, LoadBalancer for production external access (avoid NodePort in production due to security), headless for StatefulSets. Example: 'kubectl expose deployment nginx --type=LoadBalancer --port=80'.
ConfigMaps store non-confidential data as key-value pairs (max 1 MiB). Secrets store sensitive data (passwords, tokens, TLS certs). Secrets are base64-encoded but NOT encrypted by default in etcd. CRITICAL: Enable encryption at rest in production via EncryptionConfiguration. Both mount as env vars or volumes. Example: 'kubectl create secret generic db-pass --from-literal=password=mypass', 'kubectl create configmap app-config --from-file=config.yaml'. Production requirements: enable etcd encryption, apply strict RBAC, use external secret managers (HashiCorp Vault, AWS Secrets Manager) for high-security environments.
PersistentVolume (PV) is cluster storage provisioned statically by admin or dynamically via StorageClass. PersistentVolumeClaim (PVC) is a storage request specifying size and access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany). Dynamic provisioning automatically creates PVs when PVC references a StorageClass. PVs have reclaim policies: Retain (manual cleanup), Delete (auto-delete), Recycle (deprecated). StatefulSets use volumeClaimTemplates to create unique PVCs per Pod. Example: Create PVC requesting 10Gi storage, bind to PV, mount in Pod. Use CSI drivers for cloud storage (AWS EBS, Azure Disk, GCE PD).
Namespaces provide a mechanism for isolating groups of resources within a single cluster. Resource names must be unique within a namespace but not across namespaces. Use namespaces for environments with many users across multiple teams or projects. For clusters with few to tens of users, namespaces aren't necessary. Avoid 'kube-' prefix (reserved for system namespaces). Namespaces cannot be nested.
Ingress exposes HTTP/HTTPS routes to Services, providing load balancing, SSL termination, and name-based virtual hosting. An Ingress Controller (NGINX, Traefik, AWS ALB) fulfills Ingress rules - not started automatically. CRITICAL 2025 UPDATE: Ingress NGINX retiring March 2026; migrate to Gateway API (GA since v1.0, v1.4 latest). Gateway API provides role-based routing (GatewayClass, Gateway, HTTPRoute), L4/L7 protocol support, and vendor-neutral specs. Use ingress2gateway tool for migration. Example: 'kubectl apply -f ingress.yaml' creates Ingress routing traffic based on host/path rules.
Deployments manage stateless apps where Pods are interchangeable. StatefulSets manage stateful apps requiring stable identities and persistent storage. StatefulSets provide: (1) stable DNS names (pod-0.service, pod-1.service), (2) ordered deployment/scaling (pod-0 before pod-1), (3) unique persistent storage via volumeClaimTemplates. Require headless Service (clusterIP: None) for Pod network identity. Use for databases (PostgreSQL, MySQL, MongoDB), distributed systems (Kafka, Cassandra, Elasticsearch). Example: 'kubectl scale statefulset mysql --replicas=5' scales sequentially. Production: set PodManagementPolicy to Parallel for faster scaling.
A DaemonSet ensures all (or selected) Nodes run exactly one copy of a Pod. Automatically adds Pods to new nodes. Common use cases: (1) log collection (Fluentd, Filebeat), (2) node monitoring (Prometheus Node Exporter, Datadog agent), (3) networking (CNI plugins like Calico, Cilium), (4) storage daemons (Ceph, GlusterFS). Use nodeSelector or affinity to run on specific nodes. DaemonSets ignore unschedulable taints. Example: 'kubectl get daemonsets -n kube-system' shows system DaemonSets. Production: set resource limits to prevent node resource exhaustion, use updateStrategy: RollingUpdate for safe updates.
Jobs create Pods that run to completion, tracking success and retrying failures. Unlike Deployments, Pods terminate after success. Use for one-time batch tasks: data migration, backups, ETL processing. Key settings: completions (successful runs needed), parallelism (concurrent Pods), backoffLimit (retry limit), activeDeadlineSeconds (timeout). 2025 best practice: set ttlSecondsAfterFinished: 900 for automatic cleanup after 15 minutes. Example: 'kubectl create job backup --image=mysql:8.0 -- mysqldump db > backup.sql'. Production: set resource limits, use restartPolicy: OnFailure (retry in same Pod) or Never (create new Pod).
CronJobs create Jobs on schedules using cron format ('0 2 * * ' = daily at 2 AM). Automatically spawn Jobs at scheduled times. Use for recurring tasks: daily backups, weekly reports, hourly cleanup. Key settings: schedule (cron syntax), concurrencyPolicy (Allow/Forbid/Replace), successfulJobsHistoryLimit: 1, failedJobsHistoryLimit: 1. 2025 best practice: combine ttlSecondsAfterFinished: 900 with history limits for automatic cleanup. Example YAML: schedule: '/5 * * * *' runs every 5 minutes. Production: use Forbid to prevent overlapping jobs, set startingDeadlineSeconds: 60 for missed runs, always set resource limits.
Use Job for one-time tasks: data migration, manual backups, incident response scripts, post-deployment cleanup. Use CronJob for scheduled recurring tasks: daily backups (0 2 * * *), weekly reports (0 9 * * 1), hourly cache clear (0 * * * *). Job patterns: completions: 1 for single run, parallelism: 5 for parallel processing, work queues with indexed jobs. CronJob patterns: concurrencyPolicy: Forbid for resource-intensive tasks (prevents overlap), Allow for lightweight tasks. Both support ttlSecondsAfterFinished for automatic cleanup. Trigger CronJob manually: 'kubectl create job manual-backup --from=cronjob/backup'.
Requests: minimum CPU/memory guaranteed to container, used for Pod scheduling (node selection). Limits: maximum resources container can use, enforced by kubelet/cgroup. CPU is compressible (throttled at limit), memory is incompressible (Pod OOMKilled if exceeded). QoS classes: Guaranteed (requests=limits), Burstable (requests<limits), BestEffort (no requests/limits). Production best practice: always set requests=limits for critical workloads (Guaranteed QoS). Example: requests: {cpu: '100m', memory: '128Mi'}, limits: {cpu: '500m', memory: '512Mi'}. Use VPA (VerticalPodAutoscaler) for automatic right-sizing recommendations.
Liveness probes detect if a container is hung (deadlock, infinite loop) and restart it. Kubelet kills and restarts container after failureThreshold consecutive failures. Types: httpGet (HTTP endpoint), exec (command), tcpSocket (port check), grpc (gRPC health check, K8s 1.27+). Key settings: initialDelaySeconds: 15 (startup delay), periodSeconds: 10 (check interval), timeoutSeconds: 2, failureThreshold: 3, successThreshold: 1 (must be 1). 2025 best practice: use dedicated /health endpoint (not /), lightweight checks (<100ms), avoid checking dependencies (use readiness instead). Example: httpGet to /health on port 8080.
Readiness probes control when Pods receive Service traffic. Failed probes remove Pod from Service endpoints (no traffic) without restarting container. Passes again → Pod added back to endpoints. Critical for: initialization periods, dependency checks (database connection), warm-up phases. Should check dependencies liveness probes skip. Types: httpGet (most common, GET /ready), exec, tcpSocket, grpc. Key settings: periodSeconds: 5 (more frequent than liveness), failureThreshold: 3, successThreshold: 2 (prevent flapping). 2025 best practice: separate /ready (readiness) from /health (liveness) endpoints. Example: verify DB connection, cache ready, external APIs reachable.
Startup probes protect slow-starting apps from premature liveness probe failures. Disables liveness/readiness checks until startup succeeds. Use for: legacy apps, large Java/Spring Boot applications (>60s startup), database warm-up, complex initialization. Once startup probe succeeds, liveness/readiness probes activate. Calculate: failureThreshold × periodSeconds = max startup time. Example: failureThreshold: 30, periodSeconds: 10 = 5-minute startup window. 2025 best practice: prefer startup probes over high initialDelaySeconds on liveness probes (faster recovery after startup). After success, successThreshold must be 1. YAML: same endpoint as liveness but higher thresholds.
Helm is Kubernetes package manager, like apt/yum/homebrew for K8s. Charts are packages containing templated YAML manifests, default values, and metadata. Helm provides: Go templating (values.yaml → manifests), versioned releases (track history), atomic installs (rollback on failure), dependency management (Chart.yaml dependencies). Helm 3 (current) removed Tiller server for security. Commands: 'helm install myapp bitnami/nginx' (install), 'helm upgrade myapp' (update), 'helm rollback myapp 1' (revert). Public charts: Artifact Hub (artifacthub.io). Production: use 'helm diff' plugin before upgrades, lock chart versions.
Rolling update gradually replaces Pods with new version. Update: 'kubectl set image deployment/nginx nginx=nginx:1.27' or 'kubectl apply -f deployment.yaml'. Strategy controls: maxUnavailable (Pods down during update, default 25%), maxSurge (extra Pods during update, default 25%). Monitor: 'kubectl rollout status deployment/nginx', pause: 'kubectl rollout pause', resume: 'kubectl rollout resume'. Rollback: 'kubectl rollout undo deployment/nginx' (previous), 'kubectl rollout undo deployment/nginx --to-revision=3' (specific). History: 'kubectl rollout history deployment/nginx'. Production: set revisionHistoryLimit: 10 to retain history, use readiness probes to prevent bad rollouts.
NetworkPolicies are L3/L4 firewall rules controlling Pod traffic. By default, all Pods can communicate (no restrictions). NetworkPolicies select Pods via labels, then define ingress (incoming) and egress (outgoing) rules. Rules allow traffic from: podSelector (specific Pods), namespaceSelector (Pods in namespace), ipBlock (CIDR ranges). Requires CNI plugin support (Calico, Cilium, Weave Net, Azure CNI). Policies are additive (whitelist model). 2025 best practice: start with default deny-all policy, explicitly allow needed traffic. Example: deny all ingress, allow from app=frontend on port 8080. Production: combine with Pod Security Standards.
ServiceAccounts (SA) provide Pod identities for API authentication. Each namespace has 'default' SA. Pods auto-mount SA token for API calls. RBAC controls SA permissions: Role (namespace-scoped verbs/resources), ClusterRole (cluster-wide), RoleBinding (bind Role to SA), ClusterRoleBinding (bind ClusterRole). Example: create SA 'myapp-sa', bind Role with 'get pods' permission, set pod.spec.serviceAccountName: myapp-sa. 2025 security: never use default SA in production, apply least-privilege principle, use Pod Security Standards Restricted profile, audit RBAC regularly with 'kubectl auth can-i --list --as=system:serviceaccount:ns:sa'.
HPA automatically scales Pod replicas based on metrics (CPU, memory, custom). Checks every 15s (default), adjusts Deployment/StatefulSet/ReplicaSet replicas. Requires metrics-server for resource metrics. Example: 'kubectl autoscale deployment app --cpu-percent=70 --min=2 --max=10'. Formula: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). 2025 advanced: use KEDA for event-driven autoscaling (Kafka lag, queue depth, 60+ scalers), combine with VPA for vertical scaling, set behavior.scaleDown.stabilizationWindowSeconds: 300 to prevent flapping. HPA doesn't work with DaemonSets. Production: always set resource requests (HPA requires them).
Labels are key-value pairs for organizing objects (max 63 chars per value). Selectors query objects by labels. Equality-based: 'app=nginx', 'env!=dev'. Set-based: 'env in (prod,staging)', 'tier notin (cache,db)'. Services use selectors for Pod routing, Deployments for replica management, NetworkPolicies for traffic rules. Example: label Pod with 'app=web,tier=frontend', Service selector 'app=web' routes to it. Production best practices: use standard labels (app.kubernetes.io/name, app.kubernetes.io/version, app.kubernetes.io/component), apply via 'kubectl label', query via 'kubectl get pods -l app=web,env=prod'.
Init containers run sequentially before app containers, must succeed before app starts. Use cases: (1) wait for dependencies ('until nslookup db; do sleep 2; done'), (2) setup tasks (git clone, config generation, chmod permissions), (3) security (download certs, validate licenses). Init containers can use different images, security contexts, and access Secrets app shouldn't see. K8s 1.31+ sidecar containers: set restartPolicy: Always on init container to run alongside app. Example: init container downloads config from S3, main container reads it. Always re-run if Pod restarts.
Node affinity constrains Pod scheduling to specific nodes. NodeSelector: simple key=value matching, hard requirement (Pod won't schedule if no match). Node affinity: expressive rules with operators (In, NotIn, Exists, DoesNotExist, Gt, Lt), supports soft preferences. Types: requiredDuringSchedulingIgnoredDuringExecution (hard, must match), preferredDuringSchedulingIgnoredDuringExecution (soft, prefer but not required with weight 1-100). Example: require node in us-west, prefer SSD storage (weight 80). Use nodeSelector for simple 'disktype=ssd', node affinity for complex multi-condition rules. Production: combine with pod affinity/anti-affinity for workload distribution.
Taints repel Pods from nodes. Tolerations allow Pods to tolerate (schedule on) tainted nodes. Taint effects: NoSchedule (no new Pods), PreferNoSchedule (avoid if possible), NoExecute (evict existing Pods without toleration). Apply: 'kubectl taint nodes node1 gpu=true:NoSchedule'. Remove: 'kubectl taint nodes node1 gpu-'. Pod toleration matches taint: key, value (optional), operator (Equal/Exists), effect. Use cases: dedicated nodes (GPU, high-memory), maintenance (drain nodes), node problems (disk pressure). Built-in taints: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable. Production: combine with node affinity for complex placement.
StorageClass defines storage tiers (SSD, HDD, replicated) for dynamic PV provisioning. When PVC requests StorageClass, provisioner automatically creates PV. Components: provisioner (kubernetes.io/aws-ebs, disk.csi.azure.com, ebs.csi.aws.com), parameters (type, IOPS, zones), reclaimPolicy (Delete/Retain), volumeBindingMode (Immediate/WaitForFirstConsumer). Example: StorageClass 'fast' with SSD provisioner, PVC requests 'fast', PV auto-created. Default StorageClass auto-applies to PVCs without storageClassName. 2025 best practice: use CSI drivers (Container Storage Interface), set WaitForFirstConsumer for zone-aware provisioning, Retain policy for production data.
Kubernetes uses Container Runtime Interface (CRI) for pluggable container runtimes. containerd is the industry-standard runtime (CNCF graduated, Docker's internal engine). K8s 1.24+ removed dockershim - Docker deprecated. 2025 status: most clusters use containerd (default in GKE, EKS, AKS) for better performance, smaller footprint, faster Pod startup. Other runtimes: CRI-O (lightweight, OCI-compliant), Mirantis Container Runtime (Docker fork). containerd manages image pulling, storage, execution, lifecycle. Check runtime: 'kubectl get nodes -o wide' (CONTAINER-RUNTIME column). Production: containerd with cgroup v2 for improved resource isolation.
Control plane manages cluster state. Components: (1) kube-apiserver: REST API front-end, validates/processes requests, horizontally scalable (run 3+ for HA). (2) etcd: distributed key-value store for cluster state, consistency via Raft protocol (requires 3/5/7 nodes for quorum). (3) kube-scheduler: assigns Pods to nodes via scoring (resources, affinity, taints). (4) kube-controller-manager: runs control loops (Deployment, ReplicaSet, Node, ServiceAccount controllers). (5) cloud-controller-manager: cloud-specific (LoadBalancer, node lifecycle). Node components: kubelet (Pod lifecycle agent), kube-proxy (network rules), container runtime (containerd). Production: run control plane on separate nodes.
Get resources: 'kubectl get pods/deployments/services'. Describe: 'kubectl describe pod
Gateway API v1.0 reached GA October 31, 2023 after 4 years development, marking most collaborative API in Kubernetes history. Latest version v1.4.0 released October 6, 2025 with new GA features including default gateways and updated client certificate validation. Successor to Ingress with 20+ implementations. Gateway, GatewayClass, HTTPRoute graduated to GA in v1.0. Ingress development frozen - all new features go to Gateway API. Supports both L4/L7 protocols (TCP, UDP, HTTP, gRPC), role-based separation, portable across providers without annotations.
Gateway API role-based architecture (2025): Separates networking concerns into three distinct roles (personas) with different permissions and responsibilities, preventing configuration conflicts and enabling team-based workflows in multi-tenant clusters. Three roles: (1) Infrastructure Provider (Ian): Platform engineer managing underlying network infrastructure - provisions GatewayClass resources defining available gateway implementations (AWS ALB, GCP Load Balancer, NGINX, Envoy, Istio), configures cloud provider integrations (VPC, security groups, DNS), sets cluster-wide policies (TLS versions, rate limits, WAF rules). Interacts with: GatewayClass (defines capabilities like supported protocols, features, controller name). Example: Creates GatewayClass named 'aws-alb' pointing to AWS Load Balancer Controller, specifying L7 HTTP/HTTPS support. (2) Cluster Operator (Chihiro): Platform admin managing cluster-level networking resources - creates Gateway instances referencing GatewayClass (selects infrastructure provider's implementation), configures listeners (ports, protocols, TLS certificates), manages external IP/DNS assignments, sets up shared gateways for multiple teams. Interacts with: Gateway (traffic entry points with listeners, TLS config, namespace scope). Example: Creates Gateway 'prod-gateway' with HTTPS listener on port 443, references 'aws-alb' GatewayClass, provisions AWS ALB with public IP. (3) Application Developer (Ana): Developer deploying applications - creates Route resources (HTTPRoute, GRPCRoute, TCPRoute, TLSRoute) defining traffic routing rules (path-based, header-based, weight-based), attaches Routes to existing Gateways (references Gateway by name, no cluster-level permissions needed), configures app-specific routing (path rewrites, redirects, timeouts, retries). Interacts with: HTTPRoute/GRPCRoute/TCPRoute (routing rules attached to Gateway listeners). Example: Creates HTTPRoute routing /api/users to users-service, /api/orders to orders-service, attaches to 'prod-gateway' without touching Gateway config. Why separation matters: (1) Prevents conflicts: Multiple developers can create Routes without coordinating Gateway changes (Ana's /api/users route doesn't affect Chihiro's TLS settings or Ian's load balancer config). (2) Security isolation: App developers can't modify cluster networking (no access to Gateway/GatewayClass), limiting blast radius of misconfigurations. (3) Scalability: Single Gateway serves 100+ teams, each managing own Routes without cross-team coordination. (4) RBAC alignment: Each role maps to Kubernetes RBAC - Ian: cluster-admin (GatewayClass CRUD), Chihiro: namespace-admin (Gateway CRUD), Ana: developer (Route CRUD only). Production workflow (2025): (1) Setup phase (Ian): Deploy Gateway API CRDs (kubectl apply -f gateway-api-crds.yaml), install controller (AWS Load Balancer Controller, NGINX Gateway Fabric, Istio Gateway), create GatewayClasses for each environment (prod-external, prod-internal, dev). (2) Gateway provisioning (Chihiro): Create Gateways per namespace/team (team-a-gateway, team-b-gateway), configure shared gateways for multi-tenant routing, manage TLS certificates (cert-manager integration), set up DNS records for external IPs. (3) Application routing (Ana): Create HTTPRoutes for services (match: path: /api/*, backendRefs: users-service:80), update Routes for traffic splitting (canary deployments: 90% stable, 10% canary), configure retries/timeouts per route. Comparison with Ingress (no role separation): Ingress: Single resource combining gateway config + routing rules, developers need cluster-level permissions to change TLS/load balancer, configuration conflicts common (one team's annotation breaks another's routing). Gateway API: Clear separation - Ian defines infrastructure, Chihiro manages gateways, Ana focuses on routing. Role-specific permissions (RBAC example): Ian: apiGroups: [gateway.networking.k8s.io], resources: [gatewayclasses], verbs: [create, update, delete]. Chihiro: apiGroups: [gateway.networking.k8s.io], resources: [gateways], verbs: [create, update, delete], namespaces: [team-a, team-b]. Ana: apiGroups: [gateway.networking.k8s.io], resources: [httproutes, grpcroutes], verbs: [create, update, delete], namespaces: [team-a]. Cross-role collaboration: (1) Gateway attachment: Ana references Chihiro's Gateway by name in HTTPRoute (parentRefs: name: prod-gateway), no permissions needed to modify Gateway. (2) GatewayClass selection: Chihiro chooses Ian's GatewayClass in Gateway spec (gatewayClassName: aws-alb), inherits infrastructure capabilities. (3) Policy enforcement: Ian sets GatewayClass-level policies (RateLimitFilter, AuthenticationFilter), automatically applied to all Gateways/Routes using that class. Advanced features (Gateway API v1.1+): (1) ReferenceGrant: Allow cross-namespace references (Ana's Route in namespace-a can reference Chihiro's Gateway in namespace-b if ReferenceGrant permits). (2) BackendTLSPolicy: Chihiro configures backend TLS (Service mesh integration), Ana's Routes inherit secure backend communication. (3) Gateway inheritance: Child Gateways inherit settings from parent GatewayClass (TLS versions, cipher suites, security policies). Multi-tenancy example: Single cluster, 50 teams: Ian creates 1 GatewayClass (cloud-lb), Chihiro creates 3 Gateways (prod-public, prod-internal, dev), 50 teams create 200+ HTTPRoutes attached to prod-public Gateway (path-based isolation: /team-a/, /team-b/). Migration from Ingress: Use ingress2gateway tool to convert Ingress to Gateway API - Ingress becomes Gateway (listeners from Ingress TLS config) + HTTPRoute (rules from Ingress paths), annotations become Gateway API native features. Best practices (2025): (1) Ian: Create minimal GatewayClasses (1-2 per provider), document capabilities in annotations. (2) Chihiro: Use shared Gateways for cost efficiency (1 Gateway serves N Routes), namespace isolation for security. (3) Ana: Attach Routes to existing Gateways (avoid creating new Gateways), use specific path matches (avoid /* wildcards conflicting with other teams).
Gateway API advantages over Ingress (2025 comprehensive comparison): Gateway API (GA October 2023, v1.2 latest) addresses critical Ingress limitations with native features, role-based architecture, and protocol extensibility. Protocol support (major advantage): (1) Gateway API: L4 + L7 protocols - TCP (TCPRoute), UDP (UDPRoute), TLS (TLSRoute for TLS passthrough/termination), HTTP/HTTPS (HTTPRoute), gRPC (GRPCRoute, GA in v1.1 May 2024), WebSocket (via HTTPRoute upgrade headers). Supports non-HTTP use cases: database proxies (MySQL, PostgreSQL over TCP), game servers (UDP), message queues (AMQP, MQTT). (2) Ingress: L7 HTTP/HTTPS only - cannot route TCP/UDP traffic, no native gRPC support (requires NGINX annotations), limited to web applications. Portability and standardization: (1) Gateway API: 20+ implementations (AWS Load Balancer Controller, GCP GKE Gateway, Azure Application Gateway, NGINX Gateway Fabric, Istio, Envoy Gateway, Traefik, Kong, HAProxy, Contour) with standardized config - same Gateway YAML works across providers (vendor-neutral). No vendor-specific annotations needed for basic features (TLS, redirects, rewrites, headers). (2) Ingress: Heavy reliance on implementation-specific annotations - NGINX uses nginx.ingress.kubernetes.io/rewrite-target, Traefik uses traefik.ingress.kubernetes.io/router.middlewares, AWS ALB uses alb.ingress.kubernetes.io/scheme (no portability). Migrating between ingress controllers requires rewriting annotations (major refactoring). Built-in advanced features (no annotations): (1) Header-based matching: HTTPRoute natively supports match by headers (X-User-Type: premium), cookies (session=abc), query params (?version=2) - Ingress requires controller-specific annotations. (2) Traffic splitting/weight-based routing: HTTPRoute backendRefs with weights (90% stable, 10% canary) for canary deployments, A/B testing - Ingress needs service mesh or controller-specific setup. (3) Request mirroring: HTTPRoute requestMirror sends copy of traffic to test backend without impacting production (shadow traffic) - Ingress lacks native support. (4) Redirects: HTTPRoute requestRedirect (HTTP → HTTPS, path rewrites, hostname changes) built-in - Ingress uses annotations. (5) Timeouts/retries: HTTPRoute timeouts (request: 30s, backendRequest: 15s) and retries (attempts: 3) native - Ingress needs service mesh. Role-based separation (prevents conflicts): (1) Gateway API: Three roles - Infrastructure Provider (GatewayClass), Cluster Operator (Gateway), Application Developer (HTTPRoute) with RBAC isolation. Developers create Routes without cluster-level permissions, multiple teams share Gateway without conflicts (path-based isolation). (2) Ingress: Single resource combining infrastructure + routing - developers need permissions to modify TLS/load balancer settings, one team's annotation can break another's routing (no isolation). Extensibility and future-proofing: (1) Gateway API: Extensible via policies (BackendTLSPolicy for backend encryption, RateLimitFilter for rate limiting, AuthenticationFilter for OAuth) and new Route types (GRPCRoute added v1.1, TCPRoute experimental). Community-driven development (SIG Network), quarterly releases with new features. (2) Ingress: Development frozen (announced 2023) - no new features, community focus shifted to Gateway API. Stuck with 2018-era capabilities (basic HTTP routing, TLS termination). Advanced routing capabilities: (1) Gateway API: Path prefix/exact/regex matching (match: path: type: PathPrefix, value: /api/), method-based routing (GET vs POST to different backends), SNI-based routing (TLS hosts), multiple matchers per rule (AND/OR logic). (2) Ingress: Limited to path prefix matching (spec.rules.http.paths), no native method/header routing. Cross-namespace routing: (1) Gateway API: ReferenceGrant allows Routes in namespace-a to reference Services in namespace-b (secure cross-namespace traffic), Gateway in shared namespace serves Routes from 100+ app namespaces (multi-tenancy). (2) Ingress: No native cross-namespace support - Ingress and Service must be in same namespace (limits multi-tenancy). Migration tooling: (1) ingress2gateway: Official tool converts Ingress to Gateway API (ingress2gateway print --input-file ingress.yaml), preserves semantics (TLS, paths, backends), warns about annotation loss (manual migration for custom features). Supports batch conversion for 100+ Ingress resources. (2) Compatibility mode: Some controllers (NGINX Gateway Fabric, Istio) support both Ingress and Gateway API during transition (gradual migration). Performance and efficiency: (1) Gateway API: Shared Gateways reduce load balancer costs (1 Gateway = 1 cloud LB serving N Routes vs N Ingress = N LBs), listener-based routing (single LB with multiple listeners), native HTTP/2 and gRPC (efficient binary protocols). (2) Ingress: Each Ingress may provision separate load balancer (depends on controller), HTTP/1.1 focused (gRPC via workarounds). Production adoption (2025): AWS EKS (AWS Load Balancer Controller v2.8+), GCP GKE (GKE Gateway GA), Azure AKS (Application Gateway for Containers), service meshes (Istio, Linkerd, Cilium) migrating to Gateway API for ingress. Real-world comparison: Scenario: Multi-tenant cluster, 50 teams, HTTP + gRPC services, canary deployments. Ingress approach: 50 Ingress resources (1 per team), 50 cloud load balancers ($50/month each = $2500/month), NGINX annotations for rewrites, service mesh for traffic splitting, no native gRPC support (requires workarounds). Gateway API approach: 1 GatewayClass, 1 Gateway (1 load balancer = $50/month), 50 HTTPRoutes + 20 GRPCRoutes (attached to shared Gateway), native traffic splitting (weights), native gRPC (no mesh needed). Savings: $2450/month + simpler config. When to use Gateway API: (1) Multi-tenant clusters (shared gateways). (2) Non-HTTP protocols (TCP, UDP, gRPC). (3) Advanced routing (header matching, traffic splitting, mirroring). (4) Cross-cloud portability (avoid vendor lock-in). (5) Modern applications (microservices, service mesh). When to keep Ingress: (1) Simple HTTP routing (basic web apps). (2) Legacy applications (migration cost > benefits). (3) Ingress-only controllers (migration not urgent). Migration timeline recommendation: Start migrating to Gateway API in 2025 (Ingress NGINX retiring March 2026), use ingress2gateway tool, run dual stack (Ingress + Gateway API) during transition, deprecate Ingress by 2026.
eBPF (Extended Berkeley Packet Filter) fundamentals: Revolutionary Linux kernel technology enabling custom programs to run in kernel space without modifying kernel source code or loading kernel modules. eBPF programs execute in sandboxed virtual machine with JIT compilation to native code, verified by kernel for safety (no crashes, no infinite loops), attach to kernel hooks (network events, system calls, tracepoints) with sub-microsecond latency. Key advantages: (1) Kernel-level performance: Runs at kernel level without context switches to userspace (10-100x faster than userspace networking). (2) Safety: Verifier ensures programs cannot crash kernel, bounded loops, memory access validation. (3) Dynamic loading: Load/unload programs at runtime without kernel reboot or module compilation. (4) Observability: See every packet, syscall, file access without instrumentation overhead. Cilium architecture with eBPF: Cilium 1.15+ (2025 stable) is eBPF-native Kubernetes CNI (Container Network Interface) replacing traditional iptables-based networking with kernel-level packet processing for L3/L4 connectivity, security, and load balancing. How Cilium uses eBPF for Kubernetes networking: (1) CNI datapath (packet forwarding): eBPF programs attached to network interfaces (veth pairs, physical NICs) intercept all packets, implement pod-to-pod routing in kernel (L3 forwarding tables), VXLAN/Geneve overlay tunneling for cross-node communication, direct routing for flat networks. Bypasses iptables entirely (no iptables rules for pod routing). Performance: 5-10 Gbps per core vs 1-2 Gbps with iptables/kube-proxy. (2) Service load balancing (kube-proxy replacement): eBPF programs implement ClusterIP/NodePort/LoadBalancer services entirely in kernel - socket-level load balancing intercepts connect() syscalls, redirects to backend pod before packet leaves kernel (zero-hop LB), supports Direct Server Return (DSR) for asymmetric routing (backend responds directly to client, bypassing load balancer). Connection tracking in eBPF maps (hash tables in kernel memory). Scales to 100K+ services/endpoints without performance degradation (vs kube-proxy iptables mode limited to ~5K services before 10+ second pod startup). (3) NetworkPolicy enforcement: eBPF programs at pod veth interfaces enforce allow/deny rules (L3/L4 filtering by IP, port, protocol, pod labels), identity-based policies using BPF maps (label → identity mapping), L7 policies via Envoy proxy integration (HTTP path/header filtering, gRPC method matching). Policy lookup in BPF hash maps (O(1) vs iptables O(n) traversal). Supports 10K+ policies per cluster. (4) Transparent encryption: eBPF programs encrypt pod-to-pod traffic using IPsec (kernel-native) or WireGuard (eBPF-accelerated) without service mesh overhead. Encryption keys managed by Cilium agent, per-node or per-pod encryption with automatic key rotation. Performance: 8-9 Gbps encrypted throughput (vs 3-4 Gbps with traditional IPsec). (5) Hubble observability: eBPF programs capture all network events (L3/L4/L7 flows, DNS queries, TCP connections, HTTP requests) and push to Hubble relay via perf buffers (ring buffers in kernel). Zero instrumentation - no sidecars, no application changes, sees all traffic including sidecar-to-sidecar in service meshes. Real-time visibility into 100K+ events/second with 1-5% CPU overhead. eBPF program lifecycle in Cilium: (1) Compilation: Cilium agent compiles eBPF programs from C templates (customized per node configuration), LLVM generates BPF bytecode, kernel verifier validates safety (rejects unsafe programs). (2) Loading: Attach programs to kernel hooks - XDP (eXpress Data Path) for early packet processing at NIC driver, TC (Traffic Control) for egress/ingress filtering, socket filters for connection tracking. (3) Runtime: Programs execute on every packet/event, access BPF maps for state (connection tracking, policy rules, endpoint info), make forwarding decisions (drop, forward, redirect, modify). Performance benchmarks (Cilium 1.15 vs kube-proxy iptables, 2025): (1) Latency: 50-100 microseconds (eBPF) vs 500-1000 microseconds (iptables) for service routing. (2) Throughput: 10 Gbps single stream (eBPF) vs 2-3 Gbps (iptables). (3) CPU overhead: 5-15% (eBPF) vs 30-50% (iptables) at 10 Gbps. (4) Scalability: Handles 100K services (eBPF) vs 5K services (iptables before degradation). (5) Pod startup: 100ms (eBPF) vs 10+ seconds (iptables with 5K+ services due to rule insertion). Cilium deployment modes (2025): (1) CNI chaining: Cilium alongside existing CNI (AWS VPC CNI, Azure CNI) - provides policy enforcement, observability without replacing datapath. (2) Native routing: Cilium as primary CNI with direct routing (BGP, static routes) - best performance, requires L3 network fabric. (3) Overlay mode: VXLAN/Geneve tunneling for clusters without BGP support (GKE, AKS). (4) Kube-proxy replacement: Cilium replaces kube-proxy entirely (enable via helm: kubeProxyReplacement: strict) - reduces resource usage, improves performance. Advanced eBPF features in Cilium: (1) XDP (eXpress Data Path): Earliest hook in network stack (NIC driver level), enables DDoS mitigation (drop malicious packets before kernel processing), load balancing at line rate (40 Gbps+), packet sampling for monitoring. (2) Socket-level LB: Intercepts application connect() syscalls, rewrites destination to backend pod IP before packet sent (saves kernel network stack traversal). (3) BPF maps: Shared state between eBPF programs and userspace - connection tracking (conntrack), endpoint info (pod IP → identity), policy rules, service backends. LRU eviction for scalability. (4) Tail calls: Chain multiple eBPF programs (split complex logic into modules), work around 4096 instruction limit per program. Production use cases: (1) Replace kube-proxy: Improve service latency 5-10x, support 10K+ services without degradation. (2) NetworkPolicy enforcement: L3/L4/L7 policies with better performance than Calico iptables mode. (3) Service mesh dataplane: Sidecar-less mesh with eBPF-accelerated Envoy (Cilium Service Mesh). (4) Multi-cluster networking: Cluster mesh with eBPF-based routing across clusters. (5) Observability: Hubble for network/security visibility without instrumentation. Requirements: Linux kernel 4.19+ (eBPF support), 5.10+ recommended (eBPF improvements), BPF filesystem mounted (/sys/fs/bpf), BTF (BPF Type Format) support for CO-RE (Compile Once Run Everywhere). Cloud provider support (2025): AWS EKS (Cilium CNI GA), GCP GKE (Cilium dataplane GA), Azure AKS (Cilium CNI preview), self-managed clusters (kubeadm, kops, Rancher). Comparison with traditional networking: (1) iptables: Userspace rule management, kernel netfilter hooks (slower), O(n) rule traversal, 5K service limit. (2) IPVS: Kernel-level LB (faster than iptables), limited to L4, no L7 visibility, separate NetworkPolicy solution needed. (3) Cilium eBPF: Kernel-level everything (routing, LB, policy, encryption), O(1) lookups via BPF maps, L3-L7 visibility, 100K+ service scale.
Cilium Service Mesh architecture (sidecar-less, 2025): Revolutionary approach using eBPF + shared Envoy proxies per node instead of per-pod sidecars, reducing resource overhead 50-80% while maintaining L7 features (mTLS, traffic management, observability). Architectural comparison: (1) Cilium Service Mesh (sidecar-less): Single Envoy proxy per node (shared across all pods on node), L3/L4 traffic handled entirely in kernel via eBPF (no proxy), L7 traffic (HTTP, gRPC requiring header inspection, mTLS termination, retries) redirected to node-level Envoy via eBPF socket redirection. Typical 100-pod cluster: 3 Envoy proxies (3 nodes) consuming 300MB total memory. Traffic flow: Pod A → eBPF routing (L3/L4) → Pod B (same node, kernel-only, sub-100us latency), Pod A → eBPF → node Envoy (L7 processing) → eBPF → Pod B on remote node (one proxy hop vs two in sidecar model). (2) Istio (sidecar per pod): Envoy sidecar injected into every pod (istio-proxy container), all traffic (L3/L4/L7) routed through sidecars, pod → sidecar → remote sidecar → pod (two proxy hops, 2-5ms added latency). Typical 100-pod cluster: 100 Envoy sidecars consuming 5-10GB memory (50-100MB per sidecar). Control plane: istiod manages 100+ sidecar configurations. (3) Linkerd (sidecar per pod): Linkerd2-proxy (Rust-based) sidecar in every pod, lighter than Envoy (10-20MB per sidecar vs 50-100MB), simplified feature set (no Lua, no advanced routing), same two-hop architecture. Typical 100-pod cluster: 100 proxies consuming 1-2GB memory. Resource overhead (production benchmarks, 2025): (1) Memory: Cilium: 100MB per node (shared Envoy) = 300MB for 3-node cluster. Istio: 50MB per pod × 100 pods = 5GB. Linkerd: 15MB per pod × 100 pods = 1.5GB. Cilium uses 94% less memory than Istio, 80% less than Linkerd. (2) CPU: Cilium: 1-5% per node (eBPF + shared Envoy) = 0.1-0.2 cores for 100 pods. Istio: 20-30% overhead per pod (sidecar + iptables) = 20-30 cores for 100 pods. Linkerd: 10-15% per pod = 10-15 cores for 100 pods. Cilium reduces CPU usage 90-95% vs Istio, 85-90% vs Linkerd. (3) Network latency: Cilium L3/L4: 50-100us (eBPF kernel routing), L7: 1-2ms (one Envoy hop). Istio: 2-5ms (two Envoy hops + iptables). Linkerd: 1-3ms (two lightweight proxy hops). Cilium L7 latency 40-60% lower than Istio. Traffic routing differences: (1) Cilium eBPF-based routing: L3/L4 traffic (simple TCP connections, no HTTP inspection needed) stays in kernel - eBPF programs intercept packets at socket layer, rewrite destination IP to backend pod, forward directly (zero userspace hops). L7 traffic requiring inspection (HTTP path-based routing, header manipulation, mTLS termination) redirected to node Envoy via eBPF socket redirection (sock_ops programs intercept connect/accept syscalls). One proxy hop: source pod → node Envoy → destination pod. (2) Istio/Linkerd iptables-based routing: All traffic redirected through sidecars via iptables REDIRECT rules (every packet traverses iptables chains), two proxy hops: source pod → source sidecar → destination sidecar → destination pod. Cannot bypass sidecars for simple L3/L4 (all traffic taxed). mTLS and security: (1) Cilium: mTLS termination at node-level Envoy (shared TLS certificates per node, managed by Cilium agent), identity-based policies via eBPF (pod labels → SPIFFE IDs), transparent encryption option: IPsec/WireGuard in kernel (eBPF-accelerated, 8-9 Gbps encrypted throughput). (2) Istio: mTLS termination at each sidecar (per-pod certificates from Citadel CA, SPIFFE-compliant), workload identity via SPIRE, certificate rotation every 24 hours. (3) Linkerd: mTLS at each sidecar (per-pod certificates from linkerd-identity service), automatic rotation, simplified trust model. All three provide zero-trust security, but Cilium's node-level approach reduces certificate management overhead (3 certs vs 100 certs for 100 pods). Observability: (1) Cilium + Hubble: Zero-instrumentation observability via eBPF (captures all network events in kernel, no sidecars/agents needed), sees L3/L4/L7 flows (DNS, TCP, HTTP, gRPC) without application changes, Hubble UI for real-time service maps, 1-5% CPU overhead for full observability. Sees traffic between Istio sidecars (kernel-level visibility includes sidecar-to-sidecar). (2) Istio: Telemetry from Envoy sidecars (access logs, metrics, traces), exports to Prometheus/Jaeger/Zipkin, kiali for service graph visualization, 5-10% additional CPU overhead for full telemetry. (3) Linkerd: Telemetry from linkerd-proxy sidecars (metrics, tap API for live traffic inspection), Linkerd Viz dashboard, lighter overhead (3-5% CPU) due to simpler proxy. Feature comparison (L7 capabilities): (1) Cilium: HTTP routing (path, header, method matching), gRPC routing, traffic splitting (canary, A/B testing), circuit breaking, retries/timeouts, fault injection, request mirroring. Powered by Envoy (inherits Envoy features). Limitations: Fewer advanced features than Istio (no Wasm plugins, no Lua filters). (2) Istio: Full Envoy feature set - advanced routing (regex, query params), Wasm plugins for custom logic, Lua scripting, rate limiting, external authorization, JWT validation, fault injection, traffic mirroring. Most feature-rich service mesh. (3) Linkerd: Simplified feature set - HTTP routing (path/header matching), traffic splitting, retries/timeouts, circuit breaking. No scripting/plugins (simpler = easier to operate, fewer CVEs). Use case recommendations (2025): (1) Cilium Service Mesh: Best for resource-constrained environments (edge clusters, IoT), large-scale deployments (1000+ pods, cost-sensitive), L3/L4 heavy workloads (databases, caches, message queues where mTLS not required), clusters already using Cilium CNI (upgrade to mesh without adding sidecars). (2) Istio: Best for complex L7 requirements (advanced routing, custom plugins, Wasm extensions), multi-cluster/multi-cloud meshes (Istio's strong suit), enterprises needing mature ecosystem (extensive tooling, commercial support). (3) Linkerd: Best for simplicity and security (small teams, fewer features = less complexity), Rust performance benefits (lower latency than Envoy for simple proxying), quick service mesh adoption (easier learning curve). Performance summary (100-pod cluster): Cilium: 300MB memory, 0.1-0.2 cores CPU, 50-100us L3/L4 latency, 1-2ms L7 latency. Istio: 5GB memory, 20-30 cores CPU, 2-5ms latency. Linkerd: 1.5GB memory, 10-15 cores CPU, 1-3ms latency. Cilium wins on resource efficiency, Istio on features, Linkerd on simplicity. Deployment complexity: (1) Cilium: Single component (Cilium agent DaemonSet includes Envoy management), simpler upgrades (no per-pod sidecar injection), fewer moving parts. Requires Cilium CNI (migration if using different CNI). (2) Istio: Complex deployment (istiod control plane, gateways, sidecars), sidecar injection via mutating webhook (automatic or manual), version skew management (control plane vs data plane). (3) Linkerd: Simple architecture (linkerd-control-plane, linkerd-proxy sidecars), easy installation via CLI (linkerd install | kubectl apply), straightforward upgrades. Migration path: Existing Istio/Linkerd users can migrate to Cilium Service Mesh for resource savings, but lose advanced L7 features (Wasm, Lua). Gradual approach: Use Cilium CNI + kube-proxy replacement first (networking layer), enable Cilium Service Mesh later (L7 layer), measure resource savings vs feature trade-offs. Cloud provider support (2025): AWS EKS (Cilium Service Mesh preview), GCP GKE (Cilium dataplane + mesh beta), Azure AKS (Cilium CNI, mesh experimental). Istio/Linkerd available on all providers.
Hubble: Zero-instrumentation network observability platform built into Cilium (eBPF-based Kubernetes networking). Architecture: Runs in Linux kernel via eBPF capturing all network/system activity without requiring application instrumentation, sidecars, or code changes. Observability capabilities (2025): (1) Service dependency maps: Visualize pod-to-pod communication, microservice dependencies, external service calls - auto-discovered in real-time. (2) Network flow logs: L3/L4/L7 flow data with source/destination IPs, ports, protocols, HTTP methods, gRPC calls. (3) DNS query monitoring: Track all DNS lookups (internal/external), detect DNS tunneling, monitor query latency. (4) HTTP/gRPC request tracing: Capture method, path, status code, latency without service mesh. (5) TCP connection tracking: SYN/ACK handshakes, connection states, retransmissions, connection lifetimes. (6) Packet drop analysis: Identify drops by policy (NetworkPolicy denies), capacity (buffer overflow), errors (checksum failures) with exact drop reasons. (7) TLS/mTLS visibility: Track encrypted connections, certificate usage (with Cilium 1.14+). Hubble UI: Web-based real-time visualization showing service graph with traffic flows, latency heatmaps, network policies applied. Hubble CLI: hubble observe command for log streaming (--pod, --namespace filters), hubble status for health checks. Performance (2025 benchmarks): 1-15% CPU overhead depending on traffic volume (5% typical for production workloads), sub-millisecond latency impact, scales to 10K+ pods per cluster. Deployment: (1) Hubble Relay: aggregates data from all cluster nodes, exposes gRPC API for CLI/UI. (2) Hubble UI: web dashboard deployed as Kubernetes service. (3) Prometheus metrics: exposes flow metrics for Grafana dashboards. Advantages vs service mesh observability (Istio, Linkerd): (1) No per-pod sidecars (reduces resource overhead 50%+), (2) Kernel-level visibility (sees all traffic including sidecar-to-sidecar), (3) Lower latency (no proxy hops), (4) Captures non-HTTP traffic (gRPC, DNS, TCP). Production use cases: Troubleshoot intermittent network issues (packet drops, DNS failures), validate NetworkPolicy enforcement (security auditing), detect anomalous traffic patterns (security monitoring), capacity planning (identify high-traffic services), compliance (log all service-to-service communication). Integration: Works with Prometheus (metrics export), Grafana (dashboards), FluentBit (log forwarding), OpenTelemetry (distributed tracing).
Falco (runtime security for cloud-native environments): Open-source threat detection engine providing real-time security monitoring for containers, Kubernetes, Linux hosts, and cloud workloads by observing kernel-level system activity. CNCF journey: Originally created by Sysdig (2016), donated to CNCF October 2018 as sandbox project, promoted to incubating January 2020, graduated to CNCF graduated project February 29, 2024 - joining elite group with Kubernetes, Prometheus, Envoy, containerd. First runtime security project to reach CNCF graduated status. Architecture: Kernel-level monitoring via eBPF probe (modern, preferred) or kernel module (legacy) intercepting all syscalls (open, execve, connect, write, unlink, etc.) from every process on host. Events enriched with container/Kubernetes metadata (pod name, namespace, labels, image, service account) via libsinsp library. Rule engine evaluates events against security policies (YAML rules), generates alerts when threats detected. Detection capabilities (2025): (1) Container escapes: Detects attempts to break out of container isolation (mounting host filesystem, accessing /proc/self/exe, exploiting container runtime vulnerabilities like runC CVE-2019-5736). (2) Privilege escalation: Identifies privilege escalation attempts (setuid binaries execution, capabilities manipulation, sudo/su abuse, kernel exploit attempts). (3) Suspicious file operations: Monitors unauthorized file access (reading /etc/shadow, modifying SSH keys in .ssh/authorized_keys, writing to /etc/passwd, tampering with logs in /var/log). (4) Network anomalies: Detects unexpected network activity (reverse shells via bash -i > /dev/tcp/, cryptocurrency mining connections to pool servers, C2 communication, port scanning from containers). (5) Malicious binary execution: Identifies execution of known attack tools (netcat, nmap, metasploit, mimikatz), cryptocurrency miners (xmrig, ethminer), shell spawning in non-interactive containers (exec /bin/bash in nginx pod). (6) Credential theft: Monitors access to credential files (Kubernetes service account tokens in /var/run/secrets, AWS credentials in .aws/credentials, Docker socket access in /var/run/docker.sock). (7) Kubernetes-specific threats: API server attacks (unauthorized kubectl exec), privileged pod creation (hostPath mounts, privileged: true), secret access anomalies. Deployment in Kubernetes: DaemonSet ensuring Falco pod runs on every cluster node (guarantees cluster-wide monitoring), hostPID: true and privileged: true for kernel-level access. Supports multiple outputs: stdout (logs), file, syslog, HTTP webhook, gRPC. Integration with alerting: Falcosidekick routes alerts to Slack, PagerDuty, Elasticsearch, AWS Security Hub, Splunk, Datadog. Rule system: Over 100 default rules in falco-rules.yaml covering MITRE ATT&CK techniques, community rules for specific frameworks (Apache, NGINX, PostgreSQL, Redis), custom rules via YAML (condition: syscall and container and rule logic). Priority levels: Emergency, Alert, Critical, Error, Warning, Notice, Informational, Debug. Example rule: Alert on shell spawning in container (rule: Terminal shell in container, condition: spawned_process and container and shell_procs and proc.tty != 0, output: Shell spawned in container (user=%user.name container_id=%container.id image=%container.image.repository), priority: WARNING). Performance: 1-5% CPU overhead per node (eBPF mode, typical workload), 200-500MB memory per node, sub-millisecond syscall monitoring latency. Scales to 1000+ node clusters. Driver options (2025): (1) Modern eBPF probe (default, recommended): CO-RE (Compile Once Run Everywhere) with BTF, no kernel headers required, works across kernel versions 5.8+, fastest performance. (2) Legacy eBPF probe: Requires kernel headers for compilation, wider kernel compatibility (4.14+). (3) Kernel module: Oldest method, requires insmod permissions, restricted in managed Kubernetes (GKE Autopilot, Fargate). Modern eBPF preferred (60% lower overhead, better security). CNCF maturity criteria met (graduation requirements): (1) Production use by 3+ end users with large-scale deployments (validated), (2) Committers from 2+ organizations (Sysdig, IBM, Shopify, others), (3) Clear versioning and release process (semantic versioning, quarterly releases), (4) Security audits completed (CNCF-funded audit, CVE response process), (5) Documented governance model (steering committee, contributor ladder). Production adoption (2025): Used by large enterprises (banks, telecoms, SaaS providers) for compliance (PCI-DSS, SOC 2, HIPAA), runtime threat detection, incident response. Integrates with Kubernetes security platforms (Sysdig Secure, Aqua Security, Prisma Cloud). Community activity: 500+ contributors, 6000+ GitHub stars, monthly releases (Falco 0.37 as of 2025), active Slack community (falco-community.slack.com). Advantages: (1) Kernel-level visibility: Sees all syscalls, cannot be bypassed by malware (unlike user-space agents). (2) Zero application changes: No instrumentation, sidecars, or code modifications needed. (3) Real-time detection: Sub-second alert latency from threat to notification. (4) Cloud-native aware: Understands Kubernetes concepts (pods, namespaces, services), enriches alerts with cluster context. (5) Extensible: Custom rules, plugin system (outputs, data sources), gRPC API for integration. Use cases: (1) Runtime threat detection: Alert on active attacks (container escape attempts, reverse shells, crypto mining). (2) Compliance auditing: Log all file access, process execution, network connections for audit trails (who accessed what, when). (3) Incident response: Forensic data for post-breach analysis (what did attacker execute, which files accessed). (4) Security policy enforcement: Detect policy violations (unauthorized kubectl exec, privileged pod creation). Comparison with other tools: (1) Falco vs Tetragon (Cilium): Both eBPF-based, Tetragon focuses on policy enforcement (kill processes), Falco on detection/alerting. Complementary. (2) Falco vs Sysdig Secure: Sysdig Secure is commercial platform built on Falco (adds UI, compliance reporting, forensics), Falco is OSS core engine. (3) Falco vs traditional SIEM: Falco provides real-time kernel-level data, SIEM aggregates logs from multiple sources. Falco outputs to SIEM for centralized analysis. Getting started: Install via Helm (helm install falco falcosecurity/falco --set driver.kind=modern_ebpf), view alerts (kubectl logs -n falco -l app.kubernetes.io/name=falco), test with rule triggering (kubectl exec into pod and run cat /etc/shadow).
Falco's eBPF-based threat detection architecture: Deep kernel integration intercepts all system calls (syscalls) from every process on Linux host, enriches events with container/Kubernetes context, evaluates against security rules, generates real-time alerts for malicious activity. eBPF probe mechanism: (1) Syscall interception: eBPF programs attached to kernel tracepoints (raw_syscalls:sys_enter, raw_syscalls:sys_exit) capture every syscall invocation before kernel processes it. Monitored syscalls include: open/openat (file access), execve/execveat (process execution), connect/accept/bind (network connections), write/read (I/O operations), unlink/rmdir (file deletion), setuid/setgid (privilege changes), clone/fork (process creation), mount/umount (filesystem operations), ptrace (process debugging/injection), socket operations (network activity). 300+ syscalls monitored in total. (2) Event enrichment: libsinsp library (part of Falco) enriches raw syscall events with container metadata (container ID, image name, image registry, pod name, namespace, labels, annotations from Kubernetes API), process context (PID, PPID, command line, user, group, working directory, environment variables), file metadata (path, permissions, owner, inode), network metadata (source/destination IP, ports, protocol). Example: Raw syscall execve(/bin/sh) enriched to Process /bin/sh spawned in container nginx-abc123, pod: web-server, namespace: production, image: nginx:1.25, user: root. (3) Rule evaluation engine: Falco rules (YAML format) define security conditions combining syscalls + metadata filters. Rule structure: rule (name), condition (logical expression with syscall filters, container checks, metadata matching), output (alert message template with variables), priority (severity level), tags (MITRE ATT&CK mappings). Example rule for crypto mining detection: condition: spawned_process and container and proc.name in (xmrig, ethminer, minerd) - triggers when mining binary executed in container. (4) Alert generation: When rule condition matches, Falco generates alert with enriched context (timestamp, rule name, priority, output message with interpolated variables, container/pod/namespace info, full syscall details). Alerts routed to configured outputs (stdout, file, syslog, HTTP webhook, gRPC). Falcosidekick receives alerts via gRPC/webhook, routes to 50+ destinations (Slack, PagerDuty, AWS Security Hub, Elasticsearch, Splunk, Datadog, Prometheus). Real-time threat detection examples: (1) Cryptocurrency mining: Falco detects execution of known miners (xmrig, ethminer, cpuminer) via execve syscall monitoring. Rule: spawned_process and proc.name in miner_binaries triggers immediate alert. Also detects mining pool connections via connect syscall to known pool IPs/domains (rule: outbound_connection and fd.sip in mining_pool_ips). Response: Kill pod, block image in admission controller. (2) File exfiltration: Monitors write syscalls to network sockets (detecting large file uploads), read syscalls on sensitive files followed by network activity (combining file access + network syscalls in temporal correlation). Example: Reading /etc/shadow then opening network connection triggers alert. Also detects compression before exfil (tar/gzip execution + network activity). (3) Privilege escalation: Detects setuid/setgid syscalls (changing user/group IDs), execution of suid binaries (sudo, su, passwd) from unexpected contexts (web server container running sudo), capability manipulation (capset syscall), kernel exploits (unusual ptrace patterns, /proc/self/mem access). Rule: process attempted privilege escalation (setuid to root from non-root process in container). (4) Rootkit installation: Monitors kernel module loading (finit_module syscall), /dev/kmem or /proc/kcore access (kernel memory manipulation), suspicious eBPF program loading (bpf syscall from unprivileged context), modification of system binaries in /bin, /usr/bin (write syscalls to trusted paths). (5) Container escape attempts: Detects mounting host filesystem (mount syscall with /host, /mnt), accessing container runtime socket (connect to /var/run/docker.sock), exploiting runC vulnerabilities (accessing /proc/self/exe, overwriting runc binary), privilege escalation to host namespace (unshare/setns syscalls). Rule: Host filesystem mounted in container (mount and container and fd.name startswith /host). (6) Reverse shell detection: Monitors shell process spawning (execve /bin/bash, /bin/sh) with network file descriptors (stdin/stdout redirected to socket), netcat/socat execution with listening ports, bash with TCP connections (bash -i > /dev/tcp). Rule: spawned_process and shell_procs and proc.pname in (netcat, nc, ncat) or fd.typechar=4 (IPv4 socket). (7) Unauthorized kubectl exec: Detects shell spawning in containers without TTY (proc.tty = 0 indicates kubectl exec without -t flag), interactive shells in non-interactive containers (nginx, redis spawning /bin/bash), unexpected user executing commands (non-root user in privileged container). Advantages over observability tools (Hubble, Prometheus): (1) Security-focused rules: Falco rules encode attack patterns (MITRE ATT&CK techniques), not just generic network flows. Example: Hubble shows connection from pod A to pod B, Falco identifies connection is reverse shell attempt. (2) Syscall-level granularity: Sees exact file paths accessed, command-line arguments, environment variables (deeper than network-only visibility). Example: Detects reading /etc/shadow (file-level), not just network exfiltration (packet-level). (3) Anomaly detection: Detects behavioral anomalies (nginx spawning /bin/bash = suspicious), not just traffic patterns. (4) Threat intelligence integration: Rules updated with latest CVEs, malware signatures, attack techniques (community-driven threat intel). Performance characteristics (2025 benchmarks): (1) CPU overhead: 1-3% per node (eBPF mode, typical workload with 50-100 pods), 3-5% for high-traffic nodes (200+ pods, 100K+ syscalls/sec). Significantly lower than kernel module mode (5-10% overhead). (2) Memory: 200-300MB per node (eBPF buffers, rule engine, enrichment cache), scales linearly with rule count (100 rules = 200MB, 500 rules = 400MB). (3) Latency: Sub-millisecond syscall interception (eBPF programs execute in <50 microseconds), 1-5ms rule evaluation (depends on rule complexity), 10-50ms alert generation (enrichment + output formatting). Total threat-to-alert latency: <100ms (real-time). (4) Scalability: Handles 100K+ syscalls/sec per node (high-traffic web servers), 1000+ pods per cluster (DaemonSet architecture distributes load), 500+ concurrent rules (optimized rule matching). Deployment best practices (DaemonSet configuration): (1) Node coverage: DaemonSet ensures Falco on every node (no blind spots), tolerates all taints (runs on tainted nodes), nodeSelector for specific node pools (separate rules for prod vs dev). (2) Privileges required: hostPID: true (see all processes), privileged: true (load eBPF programs), volumes: /var/run/docker.sock (container metadata), /proc (process info), /boot (kernel headers for eBPF compilation if needed). (3) Resource limits: CPU: request 100m, limit 1000m (burst for high-traffic nodes), memory: request 512Mi, limit 1Gi (rule engine + buffers). (4) Driver selection: Modern eBPF probe preferred (--set driver.kind=modern_ebpf in Helm), kernel module fallback for old kernels (<4.14), legacy eBPF for compatibility (4.14-5.7). Rule customization (production patterns): (1) Baseline tuning: Disable noisy rules (shell spawning in dev namespace), adjust severity (lower priority for read-only file access), add exceptions (allow legitimate admin tools). (2) Custom rules: Define organization-specific threats (access to proprietary /app/secrets, execution of banned tools, connections to blacklisted IPs). (3) MITRE ATT&CK mapping: Tag rules with MITRE techniques (TA0004 Privilege Escalation, T1068 Exploitation for Privilege Escalation), correlate alerts with attack chains. Integration with incident response (2025 workflows): (1) Automated response: Falco alert → Falcosidekick → AWS Lambda/Cloud Function → Delete pod (kubectl delete pod), block image (OPA policy update), isolate namespace (NetworkPolicy deny-all). (2) SIEM integration: Alerts to Elasticsearch/Splunk for correlation with other security events (combine Falco container escape + IDS network scan = coordinated attack). (3) Forensics: Sysdig Inspect captures full syscall trace for post-incident analysis (replay attack sequence, identify patient zero). Comparison with other runtime security tools: (1) Falco vs Aqua Tracee: Both eBPF-based, Tracee focuses on event streaming (sends all events to analytics backend), Falco on rule-based detection (local evaluation, selective alerting). Falco lower overhead (filters at source). (2) Falco vs Tetragon (Cilium): Tetragon enforces policies (kills processes, blocks syscalls via eBPF), Falco detects and alerts (passive monitoring). Complementary: Falco detects, Tetragon prevents. (3) Falco vs AppArmor/SELinux: Mandatory Access Control (MAC) systems prevent unauthorized actions (enforces policies), Falco detects policy violations (monitoring). Use together: SELinux enforces, Falco alerts on violations. Getting started example: Deploy Falco (helm install falco falcosecurity/falco --set driver.kind=modern_ebpf --set falcosidekick.enabled=true --set falcosidekick.webui.enabled=true), trigger test alert (kubectl exec into pod, run cat /etc/shadow), view alert in Falcosidekick UI or logs (kubectl logs -n falco -l app.kubernetes.io/name=falco).
Falco Kubernetes deployment (2025 best practices): Primary deployment: DaemonSet (recommended for production): Ensures Falco runs on every cluster node, automatically scales when nodes added/removed. DaemonSet pods monitor all containers/processes on their host node. Alternative: Deployment (plugin-based ingestion): Single-replica Deployment for centralized plugin-based event sources (cloud logs, Kubernetes audit logs from API server, third-party event streams). Not for syscall monitoring (requires node-level access). Driver options (syscall collection methods): (1) Modern eBPF probe (2025 default, recommended): Uses eBPF CO-RE (Compile Once, Run Everywhere) with BTF (BPF Type Format) - no kernel headers needed, works across kernel versions (5.8+), dynamic runtime injection, no kernel module compilation. Best compatibility. (2) eBPF probe (legacy): Requires kernel headers for compilation, more restrictive but still no kernel module. (3) Kernel module (falco.ko): Oldest method, requires kernel module loading (insmod), restricted in hardened environments (GKE Autopilot, Fargate), higher performance overhead. Modern eBPF preferred (60% faster than kernel module). Helm chart installation (official method): helm repo add falcosecurity https://falcosecurity.github.io/charts && helm install falco falcosecurity/falco --set driver.kind=modern_ebpf --set tty=true. Driver selection factors: Modern eBPF: works on GKE, EKS, AKS without kernel module support, automatic kernel compatibility. Kernel module: legacy systems, air-gapped environments with pre-built modules. Node-level privileges required: DaemonSet runs with hostPID, hostNetwork, privileged securityContext (access to /proc, /sys, kernel events). Resource footprint (2025 benchmarks): CPU: 50-150m per node (1-5% single core), Memory: 200-500MB per node, Network: minimal (metrics export only). High availability: DaemonSet ensures fault tolerance - node failure doesn't impact other nodes' monitoring. Scaling: Automatically scales with cluster (1 Falco pod per node), supports clusters with 1000+ nodes. Production configuration: Enable gRPC output (falco-grpc-exporter), integrate with Falcosidekick (alert routing to Slack/PagerDuty/SIEM), custom Falco rules in ConfigMap, Prometheus metrics scraping.