A Counter is a cumulative metric that only increases or resets to zero. Use it for tracking total counts like requests served, errors occurred, or tasks completed. Counter example: http_requests_total{method="GET",endpoint="/api/users"}. Query with rate(http_requests_total[5m]) to get requests per second.
Monitoring Observability FAQ & Answers
40 expert Monitoring Observability answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
40 questionsA Gauge represents a single numerical value that can increase or decrease over time. Use it for values that fluctuate like memory usage, temperature, queue size, or active connections. Gauge example: memory_usage_bytes{service="auth"}. Query directly as memory_usage_bytes or use avg_over_time(memory_usage_bytes[5m]) for time-averaged values.
A Histogram samples observations and counts them in configurable buckets. Use for request durations, response sizes, or any measurement distribution. Example with custom buckets: http_request_duration_seconds_bucket{le="0.1"}. Configure buckets: HistogramOpts{Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}}.
A Summary calculates quantiles client-side with sliding time windows, providing accurate percentiles per instance. Use when you need exact quantiles for a single service. Example: rpc_duration_seconds{quantile="0.99"}. Unlike Histograms, Summaries cannot be aggregated across multiple instances, making Histograms preferred for distributed systems.
The rate() function calculates the average per-second increase of a Counter over a time window. Essential for converting cumulative counters to meaningful rates. Example: rate(http_requests_total{job="api"}[5m]) gives requests per second. Use sum(rate(http_requests_total[5m])) by (instance) to aggregate across endpoints.
The irate() function calculates the instantaneous rate per second using the last two data points in a time window. Ideal for volatile counters that change rapidly. Example: irate(cpu_usage_total[2m]) reacts faster to changes than rate(). Use smaller time windows (1-2 minutes) for best results.
Use histogram_quantile() to calculate percentiles from Histogram bucket counts. Example: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance)) gives 95th percentile latency across instances. The le label contains bucket boundaries, enabling accurate percentile calculations.
The 'le' (less than or equal) label defines bucket upper bounds. Configure with exponential buckets for web latency: Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}. Query with sum(rate(http_request_duration_seconds_bucket[5m])) by (le) to see request distribution across buckets.
Use label selectors to filter metrics. Exact match: http_requests_total{method="GET"}. Regex match: http_requests_total{method=~"GET|POST"}. Negative match: http_requests_total{method!="HEAD"}. Chain multiple labels: http_requests_total{job="api",method="GET",code!~"2.."} for non-2xx API responses.
The increase() function returns the total increase in a Counter over a time window. Useful for counting events in specific periods. Example: increase(http_requests_total{job="api"}[1h]) shows total requests in the last hour. Combine with sum: sum(increase(http_requests_total[1h])) by (service) for per-service request counts.
Add Prometheus data source via Configuration > Data Sources > Add data source. Set URL to http://prometheus-server:9090 (or your Prometheus endpoint). Enable direct connection for internal networks or proxy for external access. Test connection and save. Default scrape interval should match your Prometheus configuration.
Create panel, select Prometheus data source, and use PromQL queries. Example latency panel: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Set unit to seconds, add legend format {{service}}, and configure alert threshold. Use visualization type 'Graph' for time series or 'Stat' for current values.
Metrics are numerical measurements collected over time for monitoring system behavior. Implement using counters (request totals), gauges (memory usage), and histograms (response times). Example: track with Prometheus client libraries: Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint']) and increment with .labels(method='GET', endpoint='/api').inc().
Logs are timestamped text records of discrete events in your system. Structure logs with JSON format including timestamp, level, message, and context. Example: {'timestamp': '2025-01-15T10:30:00Z', 'level': 'ERROR', 'service': 'auth', 'message': 'Login failed', 'user_id': '123', 'ip': '192.168.1.1'}. Ship to centralized logging with Fluent Bit or Vector.
Distributed traces track requests as they flow through multiple services, showing each operation's timing and dependencies. Implement with OpenTelemetry: initialize tracer, create spans for operations, and propagate context. Example: span = tracer.start_span('database_query'), span.set_attribute('db.statement', 'SELECT * FROM users'), span.end(). Export to Jaeger or Tempo.
Install packages: npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node. Create instrumentation.ts file: import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'; const sdk = new NodeSDK({traceExporter: new OTLPTraceExporter({url: 'http://localhost:4317'}), instrumentations: [getNodeAutoInstrumentations()], serviceName: 'my-service'}); sdk.start();. Run with --import flag: node --import ./instrumentation.ts app.js. Instrumentation must run before application code. Spans automatically created for HTTP, database, and other operations. For graceful shutdown, add SIGTERM handler calling sdk.shutdown().
SLI for availability = successful requests / total requests over time window. Prometheus query: sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100. This gives percentage of successful requests. Label filtering excludes 5xx errors from successful count. Monitor this metric to track your service's reliability against defined SLOs.
SLI for latency = percentage of requests faster than threshold. Example for 500ms: sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) * 100. Use histogram bucket le="0.5" for 500ms threshold. This gives percentage of requests completing within your latency target, essential for user experience monitoring.
Configure alertmanager.yml with route rules to group and direct alerts. Example: route: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' routes: - match: {severity: 'critical'} receiver: 'pagerduty' - match: {severity: 'warning'} receiver: 'slack'. This groups alerts by name and cluster, sends critical alerts to PagerDuty, warnings to Slack.
Set global scrape interval in prometheus.yml: global: scrape_interval: 15s. Override per job: scrape_configs: - job_name: 'api' scrape_interval: 10s static_configs: - targets: ['api:8080'] - job_name: 'database' scrape_interval: 30s static_configs: - targets: ['db-exporter:9100']. Use shorter intervals for fast-changing metrics, longer for stable metrics.
Avoid high-cardinality labels like user IDs or request IDs. Use bounded label sets: method, endpoint, status_code. Replace unbounded labels with histogram buckets. Example: instead of http_requests_total{user_id="123"}, use http_requests_total{method="GET",endpoint="/api/users"} and track user activity via separate bounded metrics or logs.
Use labels with bounded cardinality for dimensions: service, method, status_code, region. Example: http_requests_total{service="api",method="GET",status_code="200",region="us-east-1"}. Avoid high-cardinality labels like user IDs. Query with aggregation: sum(rate(http_requests_total[5m])) by (service, method) to see request rates per service and method combination.
Create YAML files in /etc/grafana/provisioning/dashboards/ and /etc/grafana/provisioning/datasources/. Example dashboard provisioning: apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/provisioning/dashboards. Place dashboard JSON files in specified path for automatic loading.
Configure otel-collector-config.yaml with receivers, processors, exporters. Example: receivers: otlp: protocols: grpc: endpoint: localhost:4317 http: endpoint: localhost:4318 processors: batch: exporters: prometheus: endpoint: "localhost:8889" jaeger: endpoint: jaeger:14250 tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]. From Collector v0.110.0+, default host is localhost (not 0.0.0.0) for security. Standard ports: 4317 (OTLP gRPC), 4318 (OTLP HTTP). Use 0.0.0.0 only in containerized environments if needed.
Configure global Prometheus to scrape aggregated metrics from regional instances. Example: scrape_configs: - job_name: 'federate' honor_labels: true metrics_path: /federate params: 'match[]': - '{__name__=~"job:.*"}' - '{__name__=~"instance:.*"}' static_configs: - targets: ['prometheus-us-east:9090', 'prometheus-eu-west:9090'] scrape_interval: 15s. Use honor_labels to preserve original labels and match[] to select specific metrics.
Add variables in dashboard settings: name: service label: Service type: Query datasource: Prometheus query: label_values(http_requests_total, service) multi: true includeAll: true. Use variable in panels with $service placeholder: rate(http_requests_total{service=~"$service"}[5m]). Variables enable users to filter dashboards by service, region, or other dimensions dynamically.
Calculate P99 latency using histogram quantiles: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)). Create Grafana panel with this query, set unit to seconds, and add alert threshold. For SLO monitoring: create separate panel for error budget: (1 - (sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])))) * 100.
Install packages: pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation. Initialize: from opentelemetry import trace; from opentelemetry.sdk.trace import TracerProvider; from opentelemetry.sdk.trace.export import BatchSpanProcessor; from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter; provider = TracerProvider(); processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317")); provider.add_span_processor(processor); trace.set_tracer_provider(provider). Get tracer: tracer = trace.get_tracer(__name__). Add spans: with tracer.start_as_current_span("database_query") as span: span.set_attribute("db.system", "postgresql"). Auto-instrumentation: opentelemetry-bootstrap -a install installs instrumentation for detected libraries.
Create recording rules file rules.yml: groups: - name: api.rules rules: - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) - record: job:http_request_duration_seconds:mean5m expr: sum(rate(http_request_duration_seconds_sum[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job). Load in prometheus.yml: rule_files: - "rules.yml". Recording rules pre-compute expensive queries for faster dashboard loading.
Implement golden signals with Prometheus metrics: Latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])), Traffic: sum(rate(http_requests_total[5m])) by (service), Errors: sum(rate(http_requests_total{code=~"5.."}[5m])) by (service), Saturation: process_resident_memory_bytes / node_memory_MemTotal_bytes. Create Grafana dashboard panels for each signal to monitor system health comprehensively.
Create alert rules in alert_rules.yml: groups: - name: api_alerts rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 2m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}" - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m.
Deploy cAdvisor to collect container metrics: docker run -v /:/rootfs:ro -v /var/run:/var/run:rw -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro -p 8080:8080 gcr.io/cadvisor/cadvisor. Configure Prometheus scrape: - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080']. Monitor container CPU: rate(container_cpu_usage_seconds_total{container!="POD"}[5m]) and memory: container_memory_working_set_bytes{container!="POD"}.
An error budget is 1 minus the SLO, quantifying acceptable imperfection. For 99.95% uptime SLO over 30 days (43,200 minutes): error budget = 0.05% = 21.6 minutes downtime allowed. Request-based: 99.9% SLO with 1,000,000 requests allows 1,000 errors. Calculate with Prometheus: 1 - (sum(rate(http_requests_total{code!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))). Error budget enables balancing innovation velocity with reliability.
Burn rate measures how fast you consume error budget relative to SLO. Multi-window, multi-burn-rate alerting (Google SRE Workbook standard) uses paired long/short windows to detect issues at different speeds. Common configuration: Fast burn: 14.4× rate over 1h (long) + 1h (short), page immediately. Medium burn: 6× rate over 6h (long) + 30m (short), alert in 30 minutes. Slow burn: 3× rate over 24h (long) + 2h (short), ticket next day. Short window = 1/12 of long window per Google best practice. Prometheus example: expr: (1 - sum(rate(http_requests_total{code!~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) > 14.4 and (1 - sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 14.4 for: 2m. Multi-window approach filters noise while catching real issues.
RED method monitors request-driven microservices using three metrics: Rate (requests per second), Errors (failed requests per second), and Duration (latency distribution). Created by Tom Wilkie, derived from Google's Four Golden Signals. Implement with Prometheus: Rate: sum(rate(http_requests_total[5m])) by (service), Errors: sum(rate(http_requests_total{code=~"5.."}[5m])) by (service), Duration: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])). Standardizes monitoring across all microservices.
USE method by Brendan Gregg checks system health by examining Utilization (% time busy), Saturation (degree of queued work), and Errors (scalar counts) for every resource. Apply to CPUs, memory, network interfaces, storage devices. Example: CPU utilization: 85% busy, CPU saturation: 3 processes queued, Memory errors: 50 late collisions. Provides complete system health check, identifying bottlenecks quickly. Use with dedicated checklists for Linux/Solaris at brendangregg.com/usemethod.html.
Prevent alert fatigue with SLO-based alerting, runbook automation, and AI-powered correlation. Best practices: Every alert must be actionable with linked runbooks. Use multi-burn-rate alerts instead of threshold alerts. Implement AIOps event correlation to reduce 1,000 raw events to one actionable alert. Auto-remediation executes runbooks without manual intervention. Regular alert reviews in retrospectives. 2025 case study: 91% alert volume reduction with 4x faster incident resolution using automated runbooks and event deduplication.
W3C Trace Context propagates trace information via HTTP headers: traceparent (trace ID, span ID, sampling flags) and tracestate (vendor-specific context). OpenTelemetry uses W3C Trace Context by default. Context propagation enables distributed tracing by correlating spans across services. Auto-instrumentation handles propagation automatically for HTTP/gRPC. Manual implementation: extract context from incoming request headers, pass to downstream services. Without propagation, traces fragment and lose request journey visibility.
Head-based sampling makes sampling decisions when the root span begins, before trace completion. Most common: Consistent Probability Sampling (sample 10% of traces). Advantages: low overhead, simple implementation. Disadvantages: cannot sample based on errors or latency since those occur after sampling decision. Example: sampler: TraceIdRatioBased(0.1) samples 10% of traces. Use for high-volume services where random sampling suffices. Combine with tail-based sampling for sophisticated decisions downstream.
Tail-based sampling makes decisions after all spans complete, enabling intelligent sampling based on errors, latency, or specific attributes. Implemented via OpenTelemetry Collector tail sampling processor. Example policies: sample all traces with errors, sample traces >1 second latency, sample 1% of successful fast traces. Requires all spans of a trace reach same collector instance. Configure: processors: tail_sampling: policies: - name: error-traces type: status_code. Ideal for production debugging but higher resource cost than head sampling.