The three pillars: Logs - discrete events with timestamps, structured or unstructured text describing what happened. Metrics - numeric measurements aggregated over time (counters, gauges, histograms). Traces - end-to-end request paths across distributed services showing timing and causality. Modern addition: some consider Profiles (CPU/memory profiling) a fourth pillar. Pillar relationships: Logs tell you WHAT happened, Metrics tell you HOW MUCH/HOW OFTEN, Traces tell you WHERE in the system and HOW LONG. Correlation is key: link logs/metrics/traces with common identifiers (trace_id, request_id) for full context. OpenTelemetry unifies collection of all three pillars with consistent semantics and correlation.
API Security Observability FAQ & Answers
25 expert API Security Observability answers researched from official documentation. Every answer cites authoritative sources you can verify.
Jump to section:
observability
11 questionsOpenTelemetry (OTel) is a CNCF project providing vendor-neutral APIs, SDKs, and tools for generating and collecting telemetry data (traces, metrics, logs). Problem solved: Before OTel, each observability vendor (Datadog, New Relic, Jaeger, Prometheus) had proprietary instrumentation. Switching vendors meant re-instrumenting your entire codebase. OTel solution: Instrument once with OTel, export to any backend. Components: API (interfaces for instrumentation), SDK (implementation), Collector (receives, processes, exports data), OTLP (wire protocol). Status (2024): Traces and Metrics are stable, Logs are stable, Profiling is experimental. Supported languages: Java, Python, Go, JavaScript, .NET, Ruby, PHP, Rust, C++, Swift, Erlang.
Trace: represents the entire journey of a request through a distributed system, from initial entry to final response. Identified by a unique trace_id that propagates across all services. Span: represents a single unit of work within a trace - one operation, one service call, one database query. Each span has: span_id (unique identifier), trace_id (parent trace), parent_span_id (calling span), name, start/end timestamps, attributes (key-value metadata), events (timestamped logs), status (ok/error). Span relationships: parent-child hierarchy forms a tree/DAG. Root span has no parent. Example trace: HTTP request (root span) -> auth service (child) -> database query (grandchild). Visualization: traces render as waterfall/Gantt charts showing timing and dependencies.
The OTel Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. Architecture: Receivers (ingest data - OTLP, Jaeger, Prometheus, etc.), Processors (transform data - batching, filtering, sampling, attribute modification), Exporters (send to backends - OTLP, Jaeger, Prometheus, vendor-specific). Deployment modes: Agent (sidecar/daemonset per host), Gateway (centralized cluster). When to use Collector vs direct export: Use Collector for: multiple export destinations, data transformation/enrichment, retry/buffering, reducing SDK complexity, credential management (SDKs don't need backend credentials). Skip Collector for: simple single-destination setups, edge/constrained environments. Resource overhead: typically 100-200MB RAM, 0.5-1 CPU core for moderate traffic.
OTLP is the native wire protocol for OpenTelemetry, designed for transmitting traces, metrics, and logs. Transports: gRPC (default, port 4317) and HTTP/1.1 (port 4318, paths: /v1/traces, /v1/metrics, /v1/logs). Encoding: Protocol Buffers (binary) for efficiency, JSON for debugging. Key features: supports all signal types in single protocol, built-in compression (gzip), designed for streaming, acknowledgment-based for reliability. Endpoints: Collector default gRPC: localhost:4317, HTTP: localhost:4318. Status: OTLP is stable for traces, metrics, and logs as of 2024. Why OTLP over vendor protocols: universal receiver support, future-proof, can switch backends without changing export code. Most backends now natively support OTLP ingestion (Datadog, New Relic, Honeycomb, Grafana Cloud).
W3C Trace Context is a standard for propagating trace correlation identifiers across service boundaries via HTTP headers. Two headers: traceparent (required) - format: {version}-{trace-id}-{parent-id}-{trace-flags}, example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01. tracestate (optional) - vendor-specific key-value pairs for additional context. Importance: without standard, each vendor used proprietary headers (X-B3-TraceId for Zipkin, X-Datadog-Trace-Id for Datadog). Interoperability was impossible. W3C Trace Context enables: cross-vendor tracing, mixed-environment debugging, standard library/framework support. Status: Level 2 published March 2024 (W3C Recommendation). OpenTelemetry uses W3C Trace Context by default. All major platforms support it: AWS X-Ray, Azure Monitor, GCP Cloud Trace, Datadog, New Relic.
Context propagation is the mechanism for passing trace context (trace_id, span_id, baggage) across process/service boundaries. Two operations: Inject - serialize context into carrier (HTTP headers, message metadata). Extract - deserialize context from carrier into current context. Propagation formats: W3C Trace Context (standard), B3 (Zipkin), Jaeger, AWS X-Ray. In OpenTelemetry: TextMapPropagator interface handles injection/extraction. Common carriers: HTTP headers, gRPC metadata, message queue headers (Kafka, RabbitMQ). Automatic vs manual: HTTP clients/servers often auto-propagate with OTel instrumentation. Async operations (queues, scheduled jobs) often require manual propagation. Critical requirement: EVERY service in the call chain must propagate context, or the trace breaks.
Head-based sampling: decision made at trace start, propagated to all spans. Types: Always On (100% - dev/low traffic), Probability (e.g., 10% of traces), Rate Limiting (N traces/second). Pros: simple, consistent. Cons: may miss rare errors, decision made without knowing outcome. Tail-based sampling: decision made after trace completes, based on full trace data. Rules: sample all errors, sample slow traces (p99), sample specific attributes. Pros: captures interesting traces, more efficient storage. Cons: requires collector/aggregation point, higher memory (buffer complete traces), added latency. Hybrid approach: head-based probability + tail-based for errors. Production recommendation: 1-10% head sampling for baseline + tail sampling for errors/slow traces. High-traffic systems may sample <1% with tail sampling for anomalies.
The Four Golden Signals (from Google SRE book): 1) Latency - time to service a request. Track success vs error latency separately (errors often fast). Measure p50, p95, p99. 2) Traffic - demand on your system. HTTP requests/second, transactions/second, API calls/minute. Segment by endpoint/operation. 3) Errors - rate of failed requests. HTTP 5xx, application errors, business logic failures. Include partial failures. 4) Saturation - how 'full' your service is. CPU, memory, disk, connection pool utilization. Track against capacity limits. Why these four: If all golden signals are healthy, service is likely healthy. Alerting priority: Error rate and latency first (user-facing impact), then saturation (proactive). RED method (Rate, Errors, Duration) is a simplified version popular for microservices. USE method (Utilization, Saturation, Errors) focuses on infrastructure resources.
Structured logs: log entries in consistent machine-parseable format (JSON, logfmt) with explicit fields. Example: {"timestamp":"2024-01-15T10:30:00Z","level":"error","service":"auth","user_id":"123","message":"login failed","trace_id":"abc123"}. Unstructured logs: free-form text requiring regex parsing. Example: '2024-01-15 10:30:00 ERROR auth - login failed for user 123'. Advantages of structured: 1) Queryable without regex (WHERE user_id='123' AND level='error'). 2) Consistent schema enables dashboards/alerts. 3) Correlation with traces via trace_id field. 4) Efficient indexing in log backends. 5) Field extraction is deterministic, not regex-dependent. Best practices: Always include - timestamp, level, service, trace_id. Context-specific - user_id, request_id, error_code. Avoid - PII, secrets, unbounded fields. Libraries: Python structlog, Node.js pino/winston, Go zerolog/zap.
OpenTelemetry is the de facto industry standard for observability in 2025 (CNCF-backed, supported by all major cloud providers). Core value: vendor-neutral framework collecting traces, metrics, and logs in one unified protocol. Key advantage: instrument once with OpenTelemetry, export to any backend (Grafana, Datadog, New Relic, Prometheus, Jaeger, etc.) vs proprietary agents per vendor. Prevents vendor lock-in while maintaining powerful analytics. Datadog: Comprehensive OpenTelemetry integration but proprietary agents/SDKs remain priority; strong visual service maps and dependency mapping. New Relic: Native OpenTelemetry support as core strategy, full specification compliance, better OpenTelemetry documentation; Kubernetes Monitoring with OpenTelemetry in preview (combines collector metrics, events, logs with K8s metadata). SigNoz: Open-source alternative fully native to OpenTelemetry, unified logs/traces/metrics UI, no proprietary components. 2025 adoption: OpenTelemetry universal across enterprises, solidified as cornerstone of telemetry collection. Trade-offs: OpenTelemetry provides future-proofing and flexibility; proprietary tools may offer slightly deeper vendor-specific integrations. Recommendation: OpenTelemetry for new implementations (industry consensus standard), leverage vendor backends for analytics while maintaining data portability.
api_security
8 questionsBroken Object Level Authorization (BOLA) is the #1 risk. BOLA occurs when an API endpoint exposes object identifiers and fails to verify the requesting user has permission to access the specific object. Attackers manipulate IDs in requests (e.g., changing /api/users/123 to /api/users/124) to access other users' data. BOLA accounts for approximately 40% of all API attacks. Prevention: implement proper authorization checks on every request that accesses a resource using user-supplied IDs, use random/unpredictable GUIDs instead of sequential integers, and never rely solely on client-supplied IDs without server-side authorization validation.
The OWASP API Security Top 10 2023: API1 - Broken Object Level Authorization (BOLA): accessing objects without authorization. API2 - Broken Authentication: flawed auth mechanisms. API3 - Broken Object Property Level Authorization: exposing/modifying sensitive object properties. API4 - Unrestricted Resource Consumption: no limits on resources/operations. API5 - Broken Function Level Authorization: accessing admin functions as regular user. API6 - Unrestricted Access to Sensitive Business Flows: automated abuse of business logic. API7 - Server Side Request Forgery (SSRF): fetching remote resources without validation. API8 - Security Misconfiguration: insecure defaults, verbose errors, CORS issues. API9 - Improper Inventory Management: undocumented/deprecated endpoints exposed. API10 - Unsafe Consumption of APIs: trusting third-party API responses without validation.
mTLS (mutual TLS) requires BOTH client and server to present certificates for authentication, unlike standard TLS where only server presents certificate. Process: 1) Server presents cert (standard TLS), 2) Server requests client cert, 3) Client presents cert, 4) Both verify each other's certs against trusted CA. Use cases: service-to-service communication in zero-trust networks, API access from known/controlled clients, microservices mesh (Istio, Linkerd use mTLS by default). Benefits: strong mutual authentication, prevents MITM attacks, no passwords/tokens to manage. Challenges: certificate lifecycle management (rotation, revocation), client certificate distribution, debugging is harder. Not recommended for: public APIs with unknown clients, browser-based applications (no client cert management). Implementation: service meshes handle mTLS automatically, or use API gateways with client cert validation.
Authentication (AuthN): verifying WHO the user/client is. Answers: 'Are you who you claim to be?' Methods: passwords, tokens, certificates, biometrics. Output: verified identity (user ID, client ID). Authorization (AuthZ): verifying WHAT the authenticated user can do. Answers: 'Are you allowed to perform this action?' Methods: RBAC (role-based), ABAC (attribute-based), ACLs, policy engines. Output: allow/deny decision. Order matters: AuthN MUST happen before AuthZ - you can't authorize an unknown entity. Common protocols: OAuth 2.0 handles authorization (access tokens grant permissions), OpenID Connect adds authentication layer (ID tokens prove identity). API security requires BOTH: AuthN at gateway/edge (validate token signature, check expiration), AuthZ at service level (check permissions for specific resource/action).
Core security functions: 1) Authentication - validate tokens/credentials at edge, reject invalid requests early. 2) Rate limiting - protect backends from abuse and DDoS. 3) Request validation - schema validation, size limits, injection protection. 4) TLS termination - decrypt at gateway, mTLS to backends. 5) IP filtering/geo-blocking for known threats. Best practices: Never expose backend services directly - all traffic through gateway. Validate JWT signatures with cached JWKS (refresh every 24h). Implement defense in depth (gateway + service-level auth). Log all requests with correlation IDs for audit trail. Use WAF rules for OWASP Top 10 protections. Set timeouts and circuit breakers. Strip sensitive headers before forwarding. Popular gateways: Kong, AWS API Gateway, Apigee, Azure API Management, Traefik, NGINX. Security headers to add: X-Content-Type-Options, X-Frame-Options, Content-Security-Policy.
SQL Injection prevention: Use parameterized queries/prepared statements (NEVER string concatenation). ORMs provide this by default. Example: cursor.execute('SELECT * FROM users WHERE id = ?', (user_id,)) not f'SELECT * FROM users WHERE id = {user_id}'. NoSQL Injection prevention: MongoDB - avoid $where with user input, use explicit field matching, validate object structure. Avoid passing raw user objects to queries. Command Injection prevention: Avoid shell execution entirely if possible. Use subprocess with array arguments (not shell=True). Whitelist allowed commands/arguments. General practices: Input validation (type, length, format, range), Output encoding, Least privilege database accounts, WAF rules for injection patterns. Testing: Use SQLMap, NoSQLMap for automated detection. OWASP ZAP for scanning. Static analysis tools (Semgrep, SonarQube) catch injection patterns in code.
Zero Trust architecture: Every API request verified continuously, no assumptions based on session/IP. OAuth 2.1 and OpenID Connect (OIDC) for token-based authentication with MFA. Mutual TLS (mTLS) required for internal service-to-service calls and sensitive APIs (bidirectional certificate validation, zero-trust compliance). Rate limiting and API gateways: Gateways centralize authentication, rate limiting, logging as policy enforcement points. Monitor credential stuffing attempts (26 billion/month industry-wide), use short-lived tokens with automatic revocation/rotation. OWASP API Security Top 10 2023: Validate against risks - #1 threat is Broken Object Level Authorization (BOLA), followed by Broken Authentication and Broken Object Property Level Authorization. Defense-in-depth: Layer encryption, monitoring, input validation, least-privilege authorization. Emerging threats 2025: API attacks increased 220% in 2024, AI-powered credential stuffing (88% of breaches use stolen credentials), schema poisoning attacks, JWT forgery via quantum computing (NIST PQC standards released Aug 2024 - CRYSTALS-Kyber, CRYSTALS-Dilithium, SPHINCS+). Mitigation: Use OWASP validators, contract testing, post-quantum cryptography preparation, strong JWT signing algorithms (ES256 minimum), token expiration policies.
API versioning manages breaking changes while supporting existing clients. Strategies: 1) URL path versioning: /api/v1/users, /api/v2/users. Pros: explicit, easy routing, cache-friendly. Cons: clutters URLs, clients must update endpoints. Most common approach. 2) Query parameter: /api/users?version=2. Pros: single endpoint, optional parameter. Cons: easy to forget, less explicit, caching issues. 3) Header versioning: Accept: application/vnd.api.v2+json or X-API-Version: 2. Pros: clean URLs, content negotiation. Cons: hidden, harder to test in browser, curl requires headers. 4) No versioning (evolve in place): Use additive changes only, never remove/change existing fields. Pros: simple. Cons: eventual cruft, limits evolution. Best practices: Choose ONE strategy consistently. Default to latest stable version if unspecified. Maintain at least N-1 version. Document deprecation timeline (typically 6-12 months). Use semantic versioning for clarity (major.minor.patch).
rate_limiting
3 questionsToken bucket adds tokens at a fixed rate (e.g., 10 tokens/second) up to a maximum bucket capacity. Each request consumes one token. If bucket is empty, request is rejected or queued. Key characteristics: allows bursts up to bucket capacity (if bucket has 100 tokens, 100 requests can fire immediately), then throttles to the refill rate. Implementation: track tokens (float), last_refill_time. On request: add (current_time - last_refill) * rate tokens (cap at max), subtract 1 if available. Use cases: APIs that need to allow legitimate traffic bursts while preventing sustained abuse. Pros: simple, allows bursts, memory-efficient (one counter per key). Cons: can allow large initial bursts if bucket starts full.
Leaky bucket processes requests at a constant rate regardless of input rate, like water leaking from a bucket at a fixed rate. Requests queue in the bucket; if bucket overflows (queue full), requests are dropped. Key characteristics: smooths bursty traffic into steady output, enforces strict rate limit with no bursts allowed. Implementation: FIFO queue with fixed processing rate. Requests enter queue if space available, processed at constant interval. Use cases: network traffic shaping, scenarios requiring smooth/predictable output rate (video streaming, payment processing). Pros: provides very smooth rate, prevents any bursting. Cons: adds latency (queuing), can reject legitimate burst traffic, more complex than token bucket.
Sliding window combines fixed window simplicity with smoother rate limiting. Two variants: Sliding Window Log - stores timestamp of each request, counts requests in last N seconds by filtering timestamps. Accurate but memory-intensive. Sliding Window Counter - uses two adjacent fixed windows, weights current window fully and previous window by overlap percentage. Example: 70% through current minute, limit 100/min, previous=80 requests, current=30. Weighted count = 30 + (80 * 0.3) = 54. Pros: prevents boundary burst problem of fixed windows, sliding counter is memory-efficient. Cons: sliding log requires storing all timestamps, sliding counter is approximate. Use cases: most production API rate limiters (Cloudflare, AWS API Gateway use variants).
jwt_security
3 questionsHS256 (HMAC-SHA256): Symmetric algorithm using single shared secret for signing and verification. Same key must exist on all verifying services. Pros: faster, simpler. Cons: secret must be shared with every verifier (security risk), can't have public verification. RS256 (RSA-SHA256): Asymmetric algorithm using private key to sign, public key to verify. Only auth server has private key, anyone can verify with public key. Pros: verifiers don't need secrets, supports JWKS (JSON Web Key Set) for key rotation, better for microservices. Cons: larger tokens, slower signing. Recommendation: Use RS256 for production APIs - allows key rotation via JWKS, verifiers can't forge tokens, works with standard OIDC flows. Use HS256 only for simple single-service scenarios.
Recommended access token expiration: 5-15 minutes for most applications. Short-lived tokens limit damage window if compromised. Specific recommendations by security level: High security (banking, healthcare): 5 minutes. Standard applications: 15 minutes. Low-risk internal tools: up to 1 hour. Never use long-lived access tokens (hours/days) - use refresh token pattern instead. Refresh token expiration: 7-30 days with rotation (new refresh token issued on each use, old one invalidated). Sliding expiration acceptable for refresh tokens. Key principle: access tokens are bearer tokens - anyone with the token can use it. Short expiration + refresh tokens balances security with user experience.
Refresh token rotation: each time a refresh token is used to get new access token, a NEW refresh token is also issued and the old one is invalidated. If an attacker steals a refresh token and the legitimate user also uses it, one will fail (using invalidated token), triggering security alert and full session revocation. Without rotation: stolen refresh token works until expiration (potentially weeks). With rotation: stolen token becomes invalid as soon as legitimate user refreshes. Implementation: store refresh token family/lineage, invalidate entire family if reuse detected. IETF BCP recommends rotation for public clients (SPAs, mobile apps). Additional protection: bind refresh tokens to client with DPoP (Demonstrating Proof-of-Possession) or use sender-constrained tokens.