monitoring_observability 40 Q&As

Monitoring Observability FAQ & Answers

40 expert Monitoring Observability answers researched from official documentation. Every answer cites authoritative sources you can verify.

unknown

40 questions
A

A Counter is a cumulative metric that only increases or resets to zero. Use it for tracking total counts like requests served, errors occurred, or tasks completed. Counter example: http_requests_total{method="GET",endpoint="/api/users"}. Query with rate(http_requests_total[5m]) to get requests per second.

99% confidence
A

A Gauge represents a single numerical value that can increase or decrease over time. Use it for values that fluctuate like memory usage, temperature, queue size, or active connections. Gauge example: memory_usage_bytes{service="auth"}. Query directly as memory_usage_bytes or use avg_over_time(memory_usage_bytes[5m]) for time-averaged values.

99% confidence
A

A Histogram samples observations and counts them in configurable buckets. Use for request durations, response sizes, or any measurement distribution. Example with custom buckets: http_request_duration_seconds_bucket{le="0.1"}. Configure buckets: HistogramOpts{Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}}.

99% confidence
A

A Summary calculates quantiles client-side with sliding time windows, providing accurate percentiles per instance. Use when you need exact quantiles for a single service. Example: rpc_duration_seconds{quantile="0.99"}. Unlike Histograms, Summaries cannot be aggregated across multiple instances, making Histograms preferred for distributed systems.

99% confidence
A

The rate() function calculates the average per-second increase of a Counter over a time window. Essential for converting cumulative counters to meaningful rates. Example: rate(http_requests_total{job="api"}[5m]) gives requests per second. Use sum(rate(http_requests_total[5m])) by (instance) to aggregate across endpoints.

99% confidence
A

Use histogram_quantile() to calculate percentiles from Histogram bucket counts. Example: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance)) gives 95th percentile latency across instances. The le label contains bucket boundaries, enabling accurate percentile calculations.

99% confidence
A

Use label selectors to filter metrics. Exact match: http_requests_total{method="GET"}. Regex match: http_requests_total{method=~"GET|POST"}. Negative match: http_requests_total{method!="HEAD"}. Chain multiple labels: http_requests_total{job="api",method="GET",code!~"2.."} for non-2xx API responses.

99% confidence
A

The increase() function returns the total increase in a Counter over a time window. Useful for counting events in specific periods. Example: increase(http_requests_total{job="api"}[1h]) shows total requests in the last hour. Combine with sum: sum(increase(http_requests_total[1h])) by (service) for per-service request counts.

99% confidence
A

Add Prometheus data source via Configuration > Data Sources > Add data source. Set URL to http://prometheus-server:9090 (or your Prometheus endpoint). Enable direct connection for internal networks or proxy for external access. Test connection and save. Default scrape interval should match your Prometheus configuration.

99% confidence
A

Create panel, select Prometheus data source, and use PromQL queries. Example latency panel: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Set unit to seconds, add legend format {{service}}, and configure alert threshold. Use visualization type 'Graph' for time series or 'Stat' for current values.

99% confidence
A

Metrics are numerical measurements collected over time for monitoring system behavior. Implement using counters (request totals), gauges (memory usage), and histograms (response times). Example: track with Prometheus client libraries: Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint']) and increment with .labels(method='GET', endpoint='/api').inc().

99% confidence
A

Logs are timestamped text records of discrete events in your system. Structure logs with JSON format including timestamp, level, message, and context. Example: {'timestamp': '2025-01-15T10:30:00Z', 'level': 'ERROR', 'service': 'auth', 'message': 'Login failed', 'user_id': '123', 'ip': '192.168.1.1'}. Ship to centralized logging with Fluent Bit or Vector.

99% confidence
A

Distributed traces track requests as they flow through multiple services, showing each operation's timing and dependencies. Implement with OpenTelemetry: initialize tracer, create spans for operations, and propagate context. Example: span = tracer.start_span('database_query'), span.set_attribute('db.statement', 'SELECT * FROM users'), span.end(). Export to Jaeger or Tempo.

99% confidence
A

Install packages: npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node. Create instrumentation.ts file: import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; const sdk = new NodeSDK({traceExporter: new OTLPTraceExporter({url: 'http://localhost:4317'}), instrumentations: [getNodeAutoInstrumentations()], serviceName: 'my-service'}); sdk.start();. Run with --import flag: node --import ./instrumentation.ts app.js. Instrumentation must run before application code. Spans automatically created for HTTP, database, and other operations. For graceful shutdown, add SIGTERM handler calling sdk.shutdown().

99% confidence
A

SLI for availability = successful requests / total requests over time window. Prometheus query: sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100. This gives percentage of successful requests. Label filtering excludes 5xx errors from successful count. Monitor this metric to track your service's reliability against defined SLOs.

Sources
99% confidence
A

SLI for latency = percentage of requests faster than threshold. Example for 500ms: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) * 100. Use histogram bucket le="0.5" for 500ms threshold. This gives percentage of requests completing within your latency target, essential for user experience monitoring.

Sources
99% confidence
A

Configure alertmanager.yml with route rules to group and direct alerts. Example: route: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' routes: - match: {severity: 'critical'} receiver: 'pagerduty' - match: {severity: 'warning'} receiver: 'slack'. This groups alerts by name and cluster, sends critical alerts to PagerDuty, warnings to Slack.

99% confidence
A

Set global scrape interval in prometheus.yml: global: scrape_interval: 15s. Override per job: scrape_configs: - job_name: 'api' scrape_interval: 10s static_configs: - targets: ['api:8080'] - job_name: 'database' scrape_interval: 30s static_configs: - targets: ['db-exporter:9100']. Use shorter intervals for fast-changing metrics, longer for stable metrics.

99% confidence
A

Avoid high-cardinality labels like user IDs or request IDs. Use bounded label sets: method, endpoint, status_code. Replace unbounded labels with histogram buckets. Example: instead of http_requests_total{user_id="123"}, use http_requests_total{method="GET",endpoint="/api/users"} and track user activity via separate bounded metrics or logs.

99% confidence
A

Use labels with bounded cardinality for dimensions: service, method, status_code, region. Example: http_requests_total{service="api",method="GET",status_code="200",region="us-east-1"}. Avoid high-cardinality labels like user IDs. Query with aggregation: sum(rate(http_requests_total[5m])) by (service, method) to see request rates per service and method combination.

99% confidence
A

Create YAML files in /etc/grafana/provisioning/dashboards/ and /etc/grafana/provisioning/datasources/. Example dashboard provisioning: apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/provisioning/dashboards. Place dashboard JSON files in specified path for automatic loading.

99% confidence
A

Configure otel-collector-config.yaml with receivers, processors, exporters. Example: receivers: otlp: protocols: grpc: endpoint: localhost:4317 http: endpoint: localhost:4318 processors: batch: exporters: prometheus: endpoint: "localhost:8889" jaeger: endpoint: jaeger:14250 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]. From Collector v0.110.0+, default host is localhost (not 0.0.0.0) for security. Standard ports: 4317 (OTLP gRPC), 4318 (OTLP HTTP). Use 0.0.0.0 only in containerized environments if needed.

99% confidence
A

Configure global Prometheus to scrape aggregated metrics from regional instances. Example: scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: /federate params: 'match[]': - '{__name__=~"job:.*"}' - '{__name__=~"instance:.*"}' static_configs: - targets: ['prometheus-us-east:9090', 'prometheus-eu-west:9090'] scrape_interval: 15s. Use honor_labels to preserve original labels and match[] to select specific metrics.

99% confidence
A

Add variables in dashboard settings: name: service label: Service type: Query datasource: Prometheus query: label_values(http_requests_total, service) multi: true includeAll: true. Use variable in panels with $service placeholder: rate(http_requests_total{service=~"$service"}[5m]). Variables enable users to filter dashboards by service, region, or other dimensions dynamically.

99% confidence
A

Calculate P99 latency using histogram quantiles: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Create Grafana panel with this query, set unit to seconds, and add alert threshold. For SLO monitoring: create separate panel for error budget: (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) * 100.

Sources
99% confidence
A

Install packages: pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation. Initialize: from opentelemetry import trace; from opentelemetry.sdk.trace import TracerProvider; from opentelemetry.sdk.trace.export import BatchSpanProcessor; from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter; provider = TracerProvider(); processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")); provider.add_span_processor(processor); trace.set_tracer_provider(provider). Get tracer: tracer = trace.get_tracer(__name__). Add spans: with tracer.start_as_current_span("database_query") as span: span.set_attribute("db.system", "postgresql"). Auto-instrumentation: opentelemetry-bootstrap -a install installs instrumentation for detected libraries.

99% confidence
A

Create recording rules file rules.yml: groups: - name: api.rules rules: - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) - record: job:http_request_duration_seconds:mean5m expr: sum(rate(http_request_duration_seconds_sum[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job). Load in prometheus.yml: rule_files: - "rules.yml". Recording rules pre-compute expensive queries for faster dashboard loading.

99% confidence
A

Implement golden signals with Prometheus metrics: Latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])), Traffic: sum(rate(http_requests_total[5m])) by (service), Errors: sum(rate(http_requests_total{code=~"5.."}[5m])) by (service), Saturation: process_resident_memory_bytes / node_memory_MemTotal_bytes. Create Grafana dashboard panels for each signal to monitor system health comprehensively.

Sources
99% confidence
A

Create alert rules in alert_rules.yml: groups: - name: api_alerts rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 2m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}" - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m.

99% confidence
A

Deploy cAdvisor to collect container metrics: docker run -v /:/rootfs:ro -v /var/run:/var/run:rw -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro -p 8080:8080 gcr.io/cadvisor/cadvisor. Configure Prometheus scrape: - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080']. Monitor container CPU: rate(container_cpu_usage_seconds_total{container!="POD"}[5m]) and memory: container_memory_working_set_bytes{container!="POD"}.

Sources
99% confidence
A

An error budget is 1 minus the SLO, quantifying acceptable imperfection. For 99.95% uptime SLO over 30 days (43,200 minutes): error budget = 0.05% = 21.6 minutes downtime allowed. Request-based: 99.9% SLO with 1,000,000 requests allows 1,000 errors. Calculate with Prometheus: 1 - (sum(rate(http_requests_total{code!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))). Error budget enables balancing innovation velocity with reliability.

99% confidence
A

Burn rate measures how fast you consume error budget relative to SLO. Multi-window, multi-burn-rate alerting (Google SRE Workbook standard) uses paired long/short windows to detect issues at different speeds. Common configuration: Fast burn: 14.4× rate over 1h (long) + 1h (short), page immediately. Medium burn: 6× rate over 6h (long) + 30m (short), alert in 30 minutes. Slow burn: 3× rate over 24h (long) + 2h (short), ticket next day. Short window = 1/12 of long window per Google best practice. Prometheus example: expr: (1 - sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) > 14.4 and (1 - sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 14.4 for: 2m. Multi-window approach filters noise while catching real issues.

99% confidence
A

RED method monitors request-driven microservices using three metrics: Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution). Created by Tom Wilkie, derived from Google's Four Golden Signals. Implement with Prometheus: Rate: sum(rate(http_requests_total[5m])) by (service), Errors: sum(rate(http_requests_total{code=~"5.."}[5m])) by (service), Duration: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])). Standardizes monitoring across all microservices.

99% confidence
A

USE method by Brendan Gregg checks system health by examining Utilization (% time busy), Saturation (degree of queued work), and Errors (scalar counts) for every resource. Apply to CPUs, memory, network interfaces, storage devices. Example: CPU utilization: 85% busy, CPU saturation: 3 processes queued, Memory errors: 50 late collisions. Provides complete system health check, identifying bottlenecks quickly. Use with dedicated checklists for Linux/Solaris at brendangregg.com/usemethod.html.

99% confidence
A

Prevent alert fatigue with SLO-based alerting, runbook automation, and AI-powered correlation. Best practices: Every alert must be actionable with linked runbooks. Use multi-burn-rate alerts instead of threshold alerts. Implement AIOps event correlation to reduce 1,000 raw events to one actionable alert. Auto-remediation executes runbooks without manual intervention. Regular alert reviews in retrospectives. 2025 case study: 91% alert volume reduction with 4x faster incident resolution using automated runbooks and event deduplication.

99% confidence
A

W3C Trace Context propagates trace information via HTTP headers: traceparent (trace ID, span ID, sampling flags) and tracestate (vendor-specific context). OpenTelemetry uses W3C Trace Context by default. Context propagation enables distributed tracing by correlating spans across services. Auto-instrumentation handles propagation automatically for HTTP/gRPC. Manual implementation: extract context from incoming request headers, pass to downstream services. Without propagation, traces fragment and lose request journey visibility.

99% confidence
A

Head-based sampling makes sampling decisions when the root span begins, before trace completion. Most common: Consistent Probability Sampling (sample 10% of traces). Advantages: low overhead, simple implementation. Disadvantages: cannot sample based on errors or latency since those occur after sampling decision. Example: sampler: TraceIdRatioBased(0.1) samples 10% of traces. Use for high-volume services where random sampling suffices. Combine with tail-based sampling for sophisticated decisions downstream.

99% confidence
A

Tail-based sampling makes decisions after all spans complete, enabling intelligent sampling based on errors, latency, or specific attributes. Implemented via OpenTelemetry Collector tail sampling processor. Example policies: sample all traces with errors, sample traces >1 second latency, sample 1% of successful fast traces. Requires all spans of a trace reach same collector instance. Configure: processors: tail_sampling: policies: - name: error-traces type: status_code. Ideal for production debugging but higher resource cost than head sampling.

99% confidence