Bucket aggregations group documents into buckets based on field values, ranges, or intervals for analytics and faceting. Common use cases: (1) Faceted search / filters: E-commerce: Group products by category, brand, price ranges. Example: terms aggregation on "category" field shows "Electronics (1,234)", "Clothing (567)". Users click facets to filter results. (2) Time-series analysis: Date histogram groups events into time buckets (hourly, daily, monthly). Example: Analyze log volume over time, sales trends per day, website traffic patterns. Use case: "Show me error count per hour for last 24 hours". (3) Category analytics: Terms aggregation finds top N values for a field. Example: "Top 10 bestselling products", "Most active users", "Popular search terms". Supports ordering by count or nested metric (top products by revenue). (4) Numeric distributions: Histogram groups numeric values into ranges. Example: Age distribution (0-10, 10-20, 20-30 years), price distribution ($0-$50, $50-$100). Use case: "Show me user age demographics". (5) Custom ranges: Range aggregation creates arbitrary buckets. Example: Price tiers (budget: $0-$100, mid: $100-$500, premium: $500+), performance tiers (slow: 0-100ms, medium: 100-500ms, fast: 500ms+). (6) Geo analytics: Geo hash grid groups documents by geographic location. Example: "Heatmap of user locations", "Sales by region". Common bucket types: terms (field values), date_histogram (time intervals), histogram (numeric intervals), range (custom ranges), filters (multiple filter criteria), nested (nested documents), geo_distance (geographic radius), significant_terms (unusual terms). Nesting: Bucket aggregations can be nested: terms > date_histogram > avg to show "average sales per day for each product category". Best practice (2025): Limit terms aggregation cardinality (use size parameter) to prevent memory issues. Use composite aggregation for pagination through large bucket sets. Combine with metric aggregations (avg, sum, max) for deeper analytics.
Search Analytics FAQ & Answers
50 expert Search Analytics answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
50 questionsMatch query analyzes input text using field's analyzer (tokenization, lowercasing, stemming, synonyms) before searching inverted index. Example: "Running Shoes" becomes "run shoe" tokens. Use for full-text search on text fields. Term query performs exact match with zero analysis, searching literal string in inverted index. Example: "ACTIVE" only matches exact "ACTIVE" (case-sensitive). Use for keyword fields (IDs, statuses, tags). Performance: Term query 20-40% faster due to skipping analysis overhead (term ~2ms vs match ~3ms). Critical mistake: Using term on text fields fails because text fields are analyzed during indexing. Query "Running shoes" won't match indexed "run shoe" tokens. Best practice (2025): Use match for text field types, term for keyword field types. For performance-critical exact matching, use keyword fields with term queries. Configure field mapping: {"status": {"type": "keyword"}} for exact matching, {"description": {"type": "text"}} for full-text search.
Architecture hierarchy: Elasticsearch index contains multiple shards. Each shard is a complete Apache Lucene index. Each Lucene index contains multiple immutable segments (mini-indexes). Each segment contains inverted index structure mapping terms to document IDs. Inverted index: Processes documents to extract unique terms/tokens, records which documents contain each term. Example: Term "docker" → [doc1, doc5, doc12]. Search workflow: Query searches each segment sequentially within shard, combines results. Segments structure: Contains inverted index (term→document mappings), stored fields (original JSON), doc values (columnar data for sorting/aggregations), norms (field length for BM25 scoring). Immutability benefits: Concurrent reads without locks, aggressive OS filesystem caching (30-50% speedup), better compression (40-60%). Segment lifecycle: New documents create new segments, updates/deletes marked in .liv files (not modified in-place), background merging combines small segments into larger ones while removing deleted docs. 2025 enhancements: Elasticsearch 8.x integrates latest Lucene advancements including enhanced I/O parallelism and specialized HNSW graph merging for vector search.
License change trigger: On January 21, 2021, Elastic NV changed Elasticsearch/Kibana licensing from permissive Apache License 2.0 to dual licensing under SSPL (Server Side Public License) and Elastic License v2. Neither SSPL nor ELv2 are OSI-approved open-source licenses. Elastic's goal: Prevent cloud providers (primarily AWS) from offering "Elasticsearch as a service" without contributing back to Elastic commercially. AWS response: April 2021 - Forked last Apache 2.0 version (Elasticsearch 7.10.2, Kibana 7.10.2) to create OpenSearch and OpenSearch Dashboards. Removed all Elastic proprietary code (X-Pack commercial features including security, ML, monitoring). First GA release: OpenSearch 1.0 in July 2021, just 3 months after fork announcement. License preservation: OpenSearch remains 100% Apache 2.0 with no proprietary elements, ensuring vendor neutrality. 2024-2025 developments: September 2024 - AWS transferred OpenSearch governance to Linux Foundation, establishing OpenSearch Foundation for vendor-neutral stewardship. Elastic added AGPLv3 license option in September 2024, making Elasticsearch officially open-source again alongside SSPL/ELv2 options. Current status (2025): Both projects are open-source - OpenSearch under Apache 2.0, Elasticsearch under AGPLv3/SSPL/ELv2 tri-licensing.
In September 2024, Elastic added Open Source Initiative (OSI) approved AGPLv3 license as option alongside SSPL and Elastic License v2 (ELv2). This made Elasticsearch officially open source again after 2021's controversial move to SSPL (source-available, not open source). Background: In 2021, Elastic changed from Apache 2.0 to dual licensing under SSPL 1.0 and ELv2 with 7.11 release. This 2021 change caused AWS to fork Elasticsearch 7.10.2 to create OpenSearch. With AGPLv3 addition in 2024, source code now available under three licenses: SSPL 1.0, AGPLv3, and ELv2 - giving users choice. Significance (2025): AGPLv3 enables customers and community to use, modify, redistribute, and collaborate on Elastic's source code under well-known open-source license. Addition doesn't affect existing SSPL or ELv2 users, no change to binary distributions. This licensing flexibility allows Elastic to regain "open source" classification while offering proprietary options. Critical difference: AGPLv3 is OSI-approved open source, SSPL is not (considered source-available). AGPLv3 copyleft provision requires source code disclosure if modified software used as network service. Impact: Addressing community trust concerns from 2021 license change.
Use Algolia for: (1) E-commerce and content discovery where search is core feature requiring instant results (1-20ms query latency at scale). (2) Rapid deployment needs - fully managed SaaS eliminates infrastructure overhead, setup takes minutes vs days. (3) Typo tolerance and relevance tuning out-of-box (Damerau-Levenshtein algorithm). (4) Limited technical resources - minimal DevOps requirements. Pricing: Record-based (1M records/month ~$1,500+), predictable scaling. Use Elasticsearch for: (1) Complex analytics beyond search - log analysis, APM, SIEM (Elastic Stack ecosystem). (2) Deep customization - modify scoring algorithms, build custom analyzers, plugins. (3) Cost-sensitive large-scale deployments - self-hosted Elasticsearch significantly cheaper at 100M+ documents (compute-based pricing vs record-based). (4) Strong technical teams comfortable managing infrastructure. (5) Full-text search with aggregations, filtering across high-cardinality fields. Performance: Both achieve <50ms with proper tuning. Algolia's edge: managed infrastructure, global CDN. Elasticsearch's edge: on-premises control, unlimited customization. Critical difference (2025): Algolia = speed + convenience premium, Elasticsearch = flexibility + cost efficiency at scale. Decision point: If search budget >$5K/month and team has DevOps capacity, evaluate Elasticsearch self-hosted. If search is mission-critical but team is small, Algolia's managed service reduces operational risk.
- Bucket aggregations: Group documents into buckets based on field values, ranges, or criteria. Examples: terms (top N field values like product categories), histogram (numeric ranges), date_histogram (time intervals), range (custom buckets), filters (multiple filter criteria). Use cases: Faceted search filters, time-series analysis, category analytics. Example: {"terms": {"field": "category.keyword", "size": 10}} groups products by category. 2. Metric aggregations: Calculate statistics from field values - mathematical operations like COUNT, SUM, MIN, MAX, AVERAGE, CARDINALITY. Can be top-level or sub-aggregations within buckets. Examples: avg, sum, min, max, stats, percentiles, cardinality. Use case: Calculate average price per category. Example: {"avg": {"field": "price"}} calculates average price. 3. Pipeline aggregations: Take input from OTHER aggregations (not documents/fields), enabling chaining and transformations. Two families: Parent (add data to existing buckets - derivative, cumulative_sum, moving_avg) and Sibling (create new metric from sibling buckets - min_bucket, max_bucket, avg_bucket). Use case: Calculate month-over-month growth rate using derivative on date_histogram. Example: {"derivative": {"buckets_path": "sales>total"}} calculates change between consecutive buckets. Best practice (2025): Combine all three types - bucket to group, metric to calculate per bucket, pipeline to analyze trends across buckets.
Parent and sibling pipeline aggregations differ fundamentally in structural positioning and how they reference other aggregations. Parent pipeline aggregations are NESTED INSIDE their parent multi-bucket aggregation, adding computed metrics directly to each bucket's output. They operate on parent aggregation metrics using buckets_path with relative paths (just the metric name without parent prefix). Examples: derivative (calculates rate of change between consecutive buckets - month-over-month growth), cumulative_sum (running total across all buckets), moving_avg (7-day moving average), moving_fn (custom moving window), serial_diff (period-over-period comparison). Use case: Time-series analysis like calculating daily sales velocity or trend smoothing. Parent aggregation structure: {"date_histogram": {"field": "@timestamp", "interval": "month"}, "aggs": {"total_sales": {"sum": {"field": "amount"}}, "sales_derivative": {"derivative": {"buckets_path": "total_sales"}}}}. Each bucket receives the derivative value. Sibling pipeline aggregations sit NEXT TO (at the same hierarchy level as) their referenced aggregation, producing independent output summary metrics from sibling bucket results. They use buckets_path with full absolute paths like "parent_agg_name>metric_name" or nested paths like "agg1>agg2>metric". Examples: avg_bucket (average of all bucket metric values), min_bucket (identifies lowest bucket), max_bucket (identifies highest), sum_bucket (total across buckets), stats_bucket (comprehensive statistics including min/max/avg), percentiles_bucket. Use case: Finding highest monthly sales value or calculating average across all monthly totals. Sibling aggregation structure: {"date_histogram": {"field": "@timestamp", "interval": "month"}, "aggs": {"total_sales": {"sum": {...}}}}, "max_monthly_sales": {"max_bucket": {"buckets_path": "sales_per_month>total_sales"}}}}. Max_monthly_sales sits at top level as independent aggregation. Critical structural difference: Parent aggregations ENRICH existing buckets (add output columns), sibling aggregations SUMMARIZE buckets (create independent summary at top level). buckets_path syntax differs significantly: Parent uses simple relative path "metric_name" referencing sibling metrics within same parent. Sibling uses full absolute path "parent_agg>metric" traversing aggregation hierarchy. Special keywords for both: "_count" (document count in each bucket), "_key" (bucket key like timestamp). Gap policy parameter (skip/insert_zeros) handles missing bucket values gracefully. Critical constraint: Pipeline aggregations process ONLY aggregation outputs, never document fields.
Synonym filters expand search terms with equivalent terms. Two types: (1) synonym_graph (recommended for search analyzers) - correctly handles multi-word synonyms like "ipod, i-pod, i pod" using token graphs. (2) synonym (legacy) - simpler but breaks multi-word synonyms, deprecated for search-time use. Application timing: (1) Index-time synonyms: Applied during indexing, expand terms before storing in inverted index. Pros: faster queries. Cons: requires full reindex to update synonyms, increases index size. (2) Search-time synonyms: Applied during query analysis only. Pros: update synonyms without reindexing (set "updateable": true), smaller index. Cons: slightly slower queries. Modern approach (2025): Since Elasticsearch 8.13, use Synonyms Management APIs instead of synonym files. Create synonym set: PUT /_synonyms/my-synonyms {"synonyms_set": [{"synonyms": "laptop, notebook, computer"}]}. Reference in analyzer: "filter": [{"type": "synonym_graph", "synonyms_set": "my-synonyms", "updateable": true}]. Update without reindex: PUT /_synonyms/my-synonyms/laptop {"synonyms": "laptop, notebook, computer, chromebook"}. Synonym formats: (1) Equivalent: "ipod, i-pod, i pod" (all interchangeable). (2) Explicit mappings: "universe, cosmos => cosmos" (only expands left side to right). Best practice (2025): Use search-time synonym_graph filter with Synonyms API and "updateable": true for maximum flexibility. Reserve index-time for static taxonomies that never change.
Okapi BM25 (Best Matching 25) is probabilistic ranking function calculating document relevance scores using three factors: (1) Term frequency (TF) with saturation, (2) Inverse document frequency (IDF) - rare terms score higher, (3) Document length normalization - prevents long documents from unfairly dominating. Formula: score = IDF(q) * (TF(q,d) * (k1 + 1)) / (TF(q,d) + k1 * (1 - b + b * (|d| / avgdl))) where k1=1.2 (default, controls TF saturation) and b=0.75 (default, controls length normalization). Adoption timeline: Elasticsearch 5.0 (2016) and Apache Lucene 6.0 switched from TF-IDF to BM25 as default similarity algorithm. Reasons for switch: TF-IDF shortcomings include no document length consideration and unsaturated term frequency (keyword stuffing inflates scores). BM25 improvements: Better term frequency saturation (diminishing returns for repetition), superior document length normalization, better relevance in production tests. Current status (2025): BM25 remains default scoring algorithm in all modern Elasticsearch versions (8.x). Configuration: Customize per-field with {"similarity": {"type": "BM25", "k1": 1.5, "b": 0.8}} in index mapping. Best practice: Use default parameters (k1=1.2, b=0.75) unless specific ranking issues identified through A/B testing.
Default values: k1 = 1.2 (term frequency saturation parameter), b = 0.75 (document length normalization factor). These defaults from academic research work well for 90%+ of use cases. How they work: (1) k1 controls term frequency saturation curve. Higher k1 (e.g., 2.0) gives more weight to repeated terms - good for technical docs where repetition signals relevance. Lower k1 (e.g., 0.8) reduces impact of term repetition - good for marketing content with keyword stuffing. Range: 1.2-2.0 typical, rarely goes below 1.0. (2) b controls document length normalization. b=1.0 fully normalizes by length (penalizes long docs heavily). b=0.0 disables normalization (favors long comprehensive docs). b=0.75 balances both. Tuning guidance (2025): Start with defaults (k1=1.2, b=0.75). If short docs rank too low, decrease b to 0.5-0.6. If keyword-stuffed docs rank too high, increase k1 to 1.5-2.0. Configuration: Set per-field in mapping with "similarity": {"my_custom_bm25": {"type": "BM25", "k1": 1.5, "b": 0.8}}. Real-world impact: Adjusting k1 from 1.2 to 1.8 in technical documentation improved precision@10 by 12% in A/B tests. Most users should stick with defaults unless specific ranking issues identified through user testing.
Algolia uses Damerau-Levenshtein distance algorithm for fuzzy matching, calculating edit distance between query and indexed terms. Supported operations: insertion, deletion, substitution, transposition (swapping adjacent characters). Typo handling rules: (1) Words with 1-3 characters: no typos allowed (too short for reliable fuzzy matching). (2) Words with 4-7 characters: 1 typo allowed. (3) Words with 8+ characters: 2 typos allowed. (4) Exception: 3 typos allowed if first typo is on initial letter (accounts for common typing mistakes). Configuration parameters: minWordSizefor1Typo (default: 4), minWordSizefor2Typos (default: 8). Ranking impact: Typo count is PRIMARY ranking criterion before all other signals. Ranking order: exact match (0 typos) > 1 typo > 2 typos. Within same typo count, other ranking criteria apply (custom ranking, text relevance, geo distance). Advanced control: Set typoTolerance per query: "true" (default), "false" (strict matching), "min" (reduce typo allowance by 1), "strict" (disable prefix matching). Use disableTypoToleranceOnWords: ["brand", "iphone"] to require exact matches for specific terms. Performance: Typo tolerance adds minimal overhead (<2ms) due to optimized prefix trees. User impact: Improves conversion rates 15-30% for e-commerce search by handling "ipone" → "iphone", "lapto" → "laptop". Best practice (2025): Enable typo tolerance globally, disable for brand names and SKUs using disableTypoToleranceOnWords or disableTypoToleranceOnAttributes.
An Elasticsearch segment is a self-contained Apache Lucene mini-index - a complete, immutable snapshot of documents written at a specific point in time. Architecture: Each shard contains multiple segments. Each segment contains four core components: (1) inverted index (term → [docID, frequency, positions] mappings enabling O(1) lookups), (2) stored fields (original JSON documents), (3) doc values (column-oriented data for sorting/aggregations), (4) norms (field length metadata for BM25 scoring). Segments are immutable by design - once written to disk via fsync, they never change. Why immutable (write-once design)? Benefits are significant: (1) Concurrent lock-free reads - multiple queries scan identical segment simultaneously with zero contention, enabling 1000+ QPS per shard without synchronization overhead. (2) OS filesystem caching - immutable files persist in OS page cache indefinitely without cache invalidation. Result: 30-50% query speedup and faster aggregations via disk I/O reduction. (3) Aggressive compression - write-once guarantee enables delta-encoding, variable-byte encoding, and frame-of-reference compression (40-60% smaller than mutable structures). (4) Simplified data structures - no concurrent write handling complexity, no versioning overhead. Deletion handling: Documents never deleted in-place. Instead: (1) Updates/deletes mark document as "deleted" in lightweight .liv (live docs) bitmap file, (2) deleted documents excluded from search results but still occupy disk space, (3) actual deletion happens only during segment merging when deleted docs physically removed. Segment merging: Background TieredMergePolicy combines small segments into larger ones (10 segments per shard target), removing deleted documents, reclaiming disk space. Merge cost: I/O intensive but throttled (default 20MB/s) to avoid impacting query performance. Real-world impact: This write-once, immutable-by-design architecture is fundamental to Elasticsearch's ability to handle billions of time-series documents efficiently - new data writes append-only to new segments while billions of historical documents remain in immutable segments optimized for reading.
Replica shards improve query performance through intelligent distribution and load balancing. Replicas enable horizontal read scaling because both primary and replica shards handle search requests identically, while only primary shards accept writes. Throughput improvement: With N replica shards, cluster can process approximately N+1 times more concurrent queries for same data. Example: 1 primary + 2 replicas = 3x query throughput vs 0 replicas. Query routing mechanism: Elasticsearch uses Adaptive Replica Selection (ARS), enabled by default in version 7+, which routes each search request to the best-performing shard copy using EWMA (exponentially weighted moving average) metrics: (1) Service time EWMA (how long prior searches took on each node), (2) Response time EWMA (network latency from coordinating node), (3) Search queue depth EWMA (number of queued search requests). This avoids routing queries to degraded nodes experiencing GC pauses, disk I/O, or network saturation. Benchmark results: Under load, ARS improved throughput 113% (41→88 queries/sec) and reduced p90 latency 64.7% (5,215ms→1,839ms) and p99 latency 60.6% (6,181ms→2,434ms). Critical distinction: Replicas improve READ throughput only. Writes (index, update, delete) require replication to all replicas before confirming, so more replicas INCREASE write latency—replicas don't improve write performance. Single-query latency unchanged: Adding replicas doesn't reduce latency of individual queries. Multiple concurrent queries benefit when distributed across replicas. Geographic distribution benefit: Placing replicas in different availability zones reduces network latency to geographically distributed users. Trade-offs: Storage cost (2 replicas = 3x disk space), write overhead (longer replication latency), recovery time (longer cluster rebalancing). Configuration: PUT /my-index/_settings {"number_of_replicas": 2}. Best practice (2025): 1-2 replicas for production (1 for cost-sensitive, 2 for mission-critical). Requires adequate cluster nodes—if each shard copy isn't on separate node, benefits disappear.
In September 2024, AWS transferred OpenSearch governance from AWS-controlled project to Linux Foundation, establishing the OpenSearch Foundation as independent steward. Significance: (1) Vendor neutrality - governance no longer controlled by single company (AWS). Decision-making through community consensus vs corporate directive. (2) Multi-vendor participation - enables IBM, SAP, Canonical, and other contributors to have equal standing alongside AWS. (3) Intellectual property protection - Linux Foundation provides neutral legal home for project IP. (4) Long-term sustainability - reduces risk of project abandonment if AWS priorities shift. (5) Enterprise trust - organizations wary of AWS lock-in now see OpenSearch as truly vendor-neutral like Linux, Kubernetes. Governance structure: Technical Steering Committee with representatives from multiple organizations, transparent RFC process for major changes, Apache 2.0 license unchanged (stays open source). Context: This move mirrors Elasticsearch's September 2024 AGPLv3 license addition - both projects addressing community trust concerns in 2024. Impact (2025): OpenSearch adoption increased 40%+ in Q4 2024 following announcement, particularly among enterprises running multi-cloud strategies. Critical for organizations evaluating OpenSearch vs Elasticsearch: OpenSearch now has truly neutral governance comparable to CNCF projects, while Elasticsearch remains Elastic-controlled (though with open-source AGPLv3 option).
Algolia's primary performance claims: (1) 1-20ms average query latency for most searches (targeting <50ms end-to-end for as-you-type search). (2) 12-200x faster than Elasticsearch depending on query complexity and Elasticsearch configuration. (3) <1ms response on simple single-character queries (e.g., "e") vs 202ms for single-shard Elasticsearch. (4) Consistent sub-5ms response times across typical queries (geo, batman, emilia) when benchmarked against unoptimized Elasticsearch. Architectural advantages: (1) Hardware: Bare metal with high-frequency 3.5-3.9 GHz Intel Xeon processors, indices entirely in RAM (256GB+ per shard), no Java GC pauses (uses C++ implementation). (2) Geographic: Distributed Search Network with 15+ regions providing 1-2ms latency reduction per 124 miles of distance, automatic replication across regions. (3) Pre-computation: Results pre-sorted at indexing time (not during query), eliminating sorting overhead. (4) Abstraction: Automatic sharding/rebalancing backend (not exposed in API), simplifying operations vs manual Elasticsearch shard management. Scale metrics (2025): 30+ billion records indexed, 1.7+ trillion searches annually. Reality check: "200x faster" claims use default Elasticsearch vs optimized Algolia on simple queries - misleading comparison. Well-tuned Elasticsearch with proper sharding, regional distribution, and caching achieves 20-50ms P95 latency, much closer to Algolia's claims. Trade-offs: Algolia wins on out-of-box performance and managed simplicity. Elasticsearch wins on flexibility (complex aggregations, joins, custom scoring), cost at 100M+ scale (self-hosted), and control. Critical context (2025): Algolia's speed claims are verifiable but optimized for e-commerce/catalog search use cases. Elasticsearch requires 3-6 months expertise investment to match similar latency but offers vastly more analytical power (APM, logs, SIEM). Choose based on use case: Algolia for search-first applications, Elasticsearch for analytics/observability or high-volume cost-sensitive deployments.
buckets_path specifies which aggregation outputs a pipeline aggregation uses as input by referencing metrics from sibling or parent aggregations. Formal syntax uses > as aggregation separator, . as metric separator, and [KEY_NAME] for multi-bucket key selection. Syntax: "buckets_path": "path>to>metric" navigates through nested aggregations to target specific metrics. Path formats: (1) Single metric: "sales>total" references total metric in sales aggregation. (2) Multi-value metrics: "stats_agg.avg" uses . separator to reference avg metric from multi-value stats aggregation. (3) Nested aggregations: "date_hist>category_terms>revenue.sum" navigates multiple aggregation levels using > separator. (4) Multi-bucket key selection: "sale_type['hat']>sales" selects specific 'hat' bucket key from multi-bucket sale_type aggregation - enables calculating metrics for individual bucket values. (5) Multiple paths (object notation): {"revenue": "sales>total", "costs": "expenses>total"} creates named variables accessible as params.revenue and params.costs in bucket_script. (6) Special keywords: "_count" (document count in bucket), "_key" (bucket key value, useful for time-series timestamp), "_value" (metric value from sibling aggregation). Comprehensive example with multi-bucket selection: {"bucket_script": {"buckets_path": {"hat_sales": "product_type['hat']>sales", "bag_sales": "product_type['bag']>sales", "shoe_sales": "product_type['shoe']>sales"}, "script": "params.hat_sales + params.bag_sales + params.shoe_sales"}}. Parent pipeline aggregations (derivative, moving_avg, cumulative_sum): use relative paths like {"buckets_path": "revenue"} referencing sibling metrics within same parent aggregation. Sibling pipeline aggregations (avg_bucket, max_bucket, sum_bucket): use full paths like {"buckets_path": "parent_agg>metric"} referencing separate sibling aggregations. Error handling: non-existent paths cause "No aggregation found" errors. Use gap_policy parameter to handle missing bucket values: "skip" (default, omit bucket), "insert_zeros" (treat as 0). Critical constraint: buckets_path references ONLY aggregation outputs, never document fields directly.
When forking Elasticsearch 7.10.2 (last Apache 2.0 version) in April 2021, Amazon removed all code incompatible with Apache 2.0 license to create OpenSearch. Removed components: (1) Entire X-Pack codebase - Elastic's commercial/proprietary features including: security (SSO, RBAC, field-level security), machine learning (anomaly detection, data frame analytics), monitoring (cluster health dashboards), alerting (commercial version), SQL query interface (commercial features), graph analytics, reporting (PDF/PNG generation). (2) Elastic-branded telemetry - phone-home metrics collection sending usage data to Elastic servers. (3) Elastic trademarks - all logos, branding, references to "Elastic" company. What Amazon replaced: (1) Security: OpenSearch Security plugin (based on Open Distro for Elasticsearch's security module, originally developed by Amazon). (2) Alerting: OpenSearch Alerting (open-source alternative). (3) Machine Learning: OpenSearch ML Commons (different architecture, k-NN focus). (4) Dashboards: OpenSearch Dashboards (fork of Kibana 7.10.2, similarly de-X-Packed). (5) SQL: OpenSearch SQL (open-source query interface). License cleanup impact: OpenSearch codebase is 100% Apache 2.0 with no proprietary elements. Clean IP allows community contributions without license concerns. Feature parity (2025): OpenSearch has rebuilt most X-Pack functionality under Apache 2.0, though some advanced ML features still lag behind Elastic's commercial offerings. Critical context: This removal was necessary because Elastic's 2021 license change (Apache 2.0 → SSPL/ELv2) made X-Pack code unusable in Apache 2.0 fork.
Match query executes slower than term query because it applies text analysis to the query string before searching the inverted index, while term query performs direct literal lookups without any preprocessing. Analysis phase adds measurable overhead: (1) Tokenization - breaking query into individual tokens based on tokenizer rules (standard tokenizer splits on whitespace and punctuation), (2) Lowercasing - normalizing case ("GET" → "get", "Running" → "running"), (3) Token filtering - removing stopwords ("the", "is", "and"), applying stemming rules ("running" → "run", "foxes" → "fox"), expanding synonyms ("laptop" → ["laptop", "notebook", "computer"]). This multi-step analysis adds 1-5ms latency per query. Term query skips all analysis: uses query string exactly as provided, performs single direct index lookup (~<1ms), returns binary match without relevance scoring. Performance measurements: Benchmark tests show term query ~2-3ms average latency while match query averages 3-5ms for typical single-field queries. Gap widens with complex analyzers (custom tokenizers, multiple filters, synonym expansion can add 2-10ms). However, absolute differences matter less than throughput: single-query perspective is milliseconds difference; high-volume perspective (10,000 queries/sec) multiplies by query volume. When to use each: Match query for full-text search on text fields (product descriptions, article content, user input where analysis improves matching). Match analyzes user input matching it against similarly-analyzed indexed terms - "Running shoes" query analyzes to ["run", "shoe"] tokens matching indexed documents. Term query for exact matching on keyword fields (UUIDs, status values like "ACTIVE", category IDs, email addresses). Term searches unanalyzed keyword fields where analysis would break matching (query "[email protected]" must match exactly, not tokenized). Critical mistake: Using term query on text fields causes mismatches because text fields store analyzed tokens while term query searches for literal strings. Query for "Running shoes" on text field searches for exact phrase token, finds nothing because indexed text contains ["run", "shoe"] tokens separately. Best practice (2025): Always match query to field type - text fields require match (or multi_match, match_phrase), keyword fields use term (or terms). For performance-critical exact matching, map as keyword type enabling efficient term queries. Use filter context instead of query context on term queries to leverage segment-level caching (2-10x faster than query context, skips scoring).
Default values: minWordSizefor1Typo = 4 characters, minWordSizefor2Typos = 8 characters. These control typo tolerance thresholds per word length. How they work: (1) Words with 1-3 characters: No typos allowed (exact match required). Rationale: Short words like "cat", "dog", "car" have too few characters for reliable fuzzy matching - 1 typo would change word entirely. (2) Words with 4-7 characters: 1 typo allowed (uses minWordSizefor1Typo = 4 threshold). Example: "ipone" matches "iphone" (1 character insertion). (3) Words with 8+ characters: 2 typos allowed (uses minWordSizefor2Typos = 8 threshold). Example: "elastcserch" matches "elasticsearch" (1 deletion + 1 transposition). Customization use cases: Strict matching for brands/SKUs: Increase minWordSizefor1Typo to 6 or disable typos entirely for specific attributes using disableTypoToleranceOnAttributes. More forgiving for international names: Decrease minWordSizefor1Typo to 3 if users frequently misspell short foreign words. Technical products: Decrease minWordSizefor2Typos to 6 if product names like "router" (6 chars) frequently have 2-typo queries like "routter". Configuration: Set in index settings: {"minWordSizefor1Typo": 5, "minWordSizefor2Typos": 10} or per-query: searchParameters: {"typoTolerance": "min", "minWordSizefor1Typo": 5}. Real-world impact: Default values (4, 8) optimized for e-commerce and content search based on Algolia's analysis of billions of queries. Most use cases should use defaults. Only adjust if seeing too many false positives (lower threshold) or too many missed matches (higher threshold). Best practice (2025): Start with defaults, adjust only after analyzing search logs showing specific typo patterns in your domain.
BM25 score formula: score = IDF(q) * (TF(q,d) * (k1 + 1)) / (TF(q,d) + k1 * (1 - b + b * (|d| / avgdl))). Components breakdown: (1) IDF(q) - Inverse Document Frequency: Measures term rarity across corpus. Formula: log(1 + (N - df + 0.5) / (df + 0.5)) where N = total docs, df = docs containing term. Rare terms get higher IDF scores (more valuable for relevance). (2) TF(q,d) - Term Frequency: How many times query term q appears in document d. Raw count, not normalized. More occurrences = higher relevance (with saturation). (3) k1 parameter (default: 1.2): Controls term frequency saturation. Higher k1 = less saturation, repeated terms matter more. Lower k1 = faster saturation, diminishing returns for repetition. (4) b parameter (default: 0.75): Controls document length normalization. b=1 fully normalizes (penalizes long docs), b=0 disables normalization (favors long docs). (5) |d| - Document length (word count in doc). (6) avgdl - Average document length across corpus. (7) boost - Optional query-level multiplier (default: 1.0). Simplified intuitive formula: score ≈ boost * IDF * (TF with saturation and length normalization). Example calculation: Query "docker" in document containing "docker" 3 times, 100 words long, avgdl=150, k1=1.2, b=0.75, IDF=2.5. TF component = (3 * 2.2) / (3 + 1.2 * (1 - 0.75 + 0.75 * (100/150))) ≈ 2.8. Final score ≈ 2.5 * 2.8 ≈ 7.0. Why BM25 over TF-IDF: Better handling of term frequency saturation (prevents keyword stuffing from inflating scores), improved document length normalization. Elasticsearch adoption: Default since version 5.0 (2016), replacing classic TF-IDF similarity.
Recommendation (2025): Use search-time synonyms for maximum flexibility and maintainability - advantages significantly outweigh minimal performance cost. Search-time is the clear best practice, with index-time justified only in rare scenarios. SEARCH-TIME SYNONYMS (RECOMMENDED): Advantages: (1) Update without reindexing - Set "updateable": true and modify synonyms via Synonyms Management API (Elasticsearch 8.10+) without reindexing any data (hours of reindex work becomes <1 second). (2) Smaller index size - Synonyms stored separately from inverted index, not duplicated across documents, reducing disk footprint 10-30% on synonym-heavy domains. (3) Fast iteration - Test changes instantly, A/B test variants, respond rapidly to trending terms (new product names, brand variations, emerging slang). (4) Centralized management - Single synonym set referenced by multiple analyzers ensures consistent behavior; changes propagate automatically. (5) Enables synonym_graph filter - Correctly handles multi-word synonyms ("credit card" ↔ "cc") using token graphs; legacy synonym filter breaks phrase matching. Disadvantages: ~1-3ms additional query latency per search for real-time synonym expansion (negligible for most applications). INDEX-TIME SYNONYMS: Advantages: (1) Marginally faster queries - Synonyms pre-expanded during indexing provides ~2% improvement (only measurable at sub-10ms latency budgets). (2) Works reliably with match_phrase queries - Phrase position matching unaffected. Disadvantages: (1) Requires full reindex to update synonyms - Every synonym change forces reindexing all documents (hours-days for large indexes, operational complexity). (2) Larger index size - All synonym expansions stored in inverted index, inflating disk 20-50%. (3) Statistics skewing - Expanded terms have artificially high frequency, degrading BM25 relevance scoring. (4) Difficult A/B testing - Can't test synonym changes without expensive full reindex. MODERN BEST PRACTICE (Elasticsearch 8.10+): Use Synonyms Management API with search-time synonym_graph. Create: PUT /_synonyms/product-synonyms {"synonyms_set": [{"synonyms": "laptop, notebook, computer, chromebook"}, {"synonyms": "quick, fast, rapid, speedy"}]}. Reference in search analyzer: {"filter": [{"type": "synonym_graph", "synonyms_set": "product-synonyms", "updateable": true}]}. Update on-the-fly: PUT /_synonyms/product-synonyms/laptop {"synonyms": "laptop, notebook, computer, chromebook"} then POST /my-index/_reload_search_analyzers (zero downtime, applies immediately). HYBRID APPROACH (RARE): Index-time only for ultra-stable domain-specific taxonomies (legal terms, regulatory categories) that change <1x per year AND have latency budget <5ms. Otherwise use search-time. Critical: Use synonym_graph filter (not deprecated synonym filter) - only synonym_graph correctly handles multi-word synonyms without breaking phrase queries. Exception: Index-time only if absolute latency requirement <5ms AND no synonym changes expected annually. For 99%+ of production use cases, search-time with Synonyms API is optimal.
Elasticsearch segment merging is an automatic background process that combines multiple small immutable Lucene segments into fewer larger segments within a shard, balancing search performance with indexing overhead. Purpose: (1) Reclaim disk space by physically expunging soft-deleted documents (marked in .liv bitmap files, not actually deleted), (2) Improve query performance by reducing segment count (searching 10 segments faster than 100), (3) Lower memory pressure from excessive segment state overhead. How it works: Elasticsearch monitors each shard's segment sizes and count. TieredMergePolicy (default since Lucene 5.0) calculates a logarithmic staircase budget based on index size and parameters. When actual segment count exceeds budget, the merge scheduler selects candidates: reads source segments into memory, combines inverted indexes, skips deleted documents, writes single merged segment to disk. Once merged segment is fsync'd, atomic switch replaces old segments (deleted after all queries finish). Key parameters (configurable): index.merge.policy.max_merged_segment (default 5GB - segments larger than this excluded from merging), index.merge.policy.segments_per_tier (default 10 - target segment count per tier), index.merge.policy.floor_segment (default 2MB - minimum segment size), index.merge.policy.max_merge_at_once (default 10 - max segments merged in single operation), index.merge.policy.expunge_deletes_allowed (default 10% - threshold to trigger delete expunging). Throttling mechanism: Auto-throttling limits merge I/O to prevent query performance degradation. Default: 20MB/s limit (configurable via index.merge.scheduler.max_merge_inprogress). When merge I/O exceeds this, indexing operations are delayed automatically. Control threads: index.merge.scheduler.max_thread_count (default Math.max(1, cores/2)) - higher allows parallel merges. Performance impact: Merging is write-amplification intensive (ratio of bytes processed vs final index size). Typical merge storms occur with heavy indexing (10M+ docs/hour), creating many small segments, causing sustained high I/O and CPU. Tradeoff: Higher refresh_interval (default 1s) reduces merge frequency but increases search latency. Force merge: POST /my-index/_forcemerge?max_num_segments=1 produces single segment, optimizing searches but preventing future merges (creates 5GB+ segments that won't merge). Critical warning: Force merging active write indexes causes problems - segments cannot participate in automatic merges, accumulating deletions, degrading performance. Production best practice (2025): Let automatic TieredMergePolicy handle routine merging. Use forcemerge only on read-only indices (completed time-series data) for snapshot optimization. Monitor merge activity: GET /_stats, /_cat/indices?v&h=merges.current,merges.total. For write-heavy workloads, consider staggered shard allocation across data tiers.
This statement is incorrect - Elasticsearch has the significantly steeper learning curve, not Algolia. Algolia advantages: User-friendly interface, intuitive APIs, comprehensive SDKs for major languages, polished developer experience, reduced maintenance needs, quicker implementation (minutes vs days). Setup is straightforward with minimal configuration. Learning curve only steepens when deep customization needed (advanced ranking algorithms, tie-breaking system). Elasticsearch challenges (2025): Notoriously steep learning curve requiring extensive technical expertise. Setup complexity: Must configure Java runtime, network ports, security settings, node roles, shard allocation, index mappings, replica strategies. Production-ready cluster requires understanding cluster health monitoring, split-brain prevention, master node election. Ranking complexity: Getting results properly ranked requires understanding BM25 parameters, custom similarity functions, function_score queries, boosting strategies. DevOps overhead: Self-hosted Elasticsearch requires ongoing management - capacity planning, backup strategies, version upgrades, security patching, performance tuning. Concepts to master: Inverted indexes, segment merging, refresh intervals, translog, circuit breakers, fielddata cache. Time investment: Elasticsearch expertise typically requires 3-6 months learning curve vs Algolia's days/weeks. Best practice (2025): Choose Algolia for rapid deployment with limited DevOps resources. Choose Elasticsearch when team has dedicated DevOps capacity and needs deep analytical capabilities beyond search.
Parent pipeline aggregation that executes custom scripts to perform per-bucket calculations on metrics from parent multi-bucket aggregations. Enables complex math operations combining multiple aggregation outputs within each bucket. Requirements: Input metrics must be numeric, script must return numeric value. Use cases: Calculate ratios (conversion rate = conversions / visits), profit margins (revenue - cost), percentages, ROI, growth rates. Configuration syntax: Uses buckets_path parameter where key is variable name for script, value is path to metric. Format: "aggregation_name>metric_name" for single path or object notation for multiple paths. Example calculating profit per month: {"bucket_script": {"buckets_path": {"revenue": "sales>total_revenue", "cost": "expenses>total_cost"}, "script": "params.revenue - params.cost"}}. Real-world example from Elasticsearch docs: Calculate t-shirt sales percentage of total monthly sales using date_histogram with filtered sub-aggregations, then bucket_script: {"buckets_path": {"tShirts": "t-shirts>sales", "total": "total>sales"}, "script": "params.tShirts / params.total * 100"}. Path navigation: Use '>' as aggregation separator, '.' as metric separator. Example: "my_bucket>my_stats.avg" references avg value in my_stats metric within my_bucket aggregation. Special keywords: "_count" (bucket document count), "_key" (bucket key value). Best practice (2025): Use bucket_script for business metrics calculations. Combine with date_histogram for time-series analytics, terms for category-based calculations.
Bucket aggregations group documents into buckets based on field values, ranges, or intervals for analytics and faceting. Common use cases: (1) Faceted search / filters: E-commerce: Group products by category, brand, price ranges. Example: terms aggregation on "category" field shows "Electronics (1,234)", "Clothing (567)". Users click facets to filter results. (2) Time-series analysis: Date histogram groups events into time buckets (hourly, daily, monthly). Example: Analyze log volume over time, sales trends per day, website traffic patterns. Use case: "Show me error count per hour for last 24 hours". (3) Category analytics: Terms aggregation finds top N values for a field. Example: "Top 10 bestselling products", "Most active users", "Popular search terms". Supports ordering by count or nested metric (top products by revenue). (4) Numeric distributions: Histogram groups numeric values into ranges. Example: Age distribution (0-10, 10-20, 20-30 years), price distribution ($0-$50, $50-$100). Use case: "Show me user age demographics". (5) Custom ranges: Range aggregation creates arbitrary buckets. Example: Price tiers (budget: $0-$100, mid: $100-$500, premium: $500+), performance tiers (slow: 0-100ms, medium: 100-500ms, fast: 500ms+). (6) Geo analytics: Geo hash grid groups documents by geographic location. Example: "Heatmap of user locations", "Sales by region". Common bucket types: terms (field values), date_histogram (time intervals), histogram (numeric intervals), range (custom ranges), filters (multiple filter criteria), nested (nested documents), geo_distance (geographic radius), significant_terms (unusual terms). Nesting: Bucket aggregations can be nested: terms > date_histogram > avg to show "average sales per day for each product category". Best practice (2025): Limit terms aggregation cardinality (use size parameter) to prevent memory issues. Use composite aggregation for pagination through large bucket sets. Combine with metric aggregations (avg, sum, max) for deeper analytics.
Match query analyzes input text using field's analyzer (tokenization, lowercasing, stemming, synonyms) before searching inverted index. Example: "Running Shoes" becomes "run shoe" tokens. Use for full-text search on text fields. Term query performs exact match with zero analysis, searching literal string in inverted index. Example: "ACTIVE" only matches exact "ACTIVE" (case-sensitive). Use for keyword fields (IDs, statuses, tags). Performance: Term query 20-40% faster due to skipping analysis overhead (term ~2ms vs match ~3ms). Critical mistake: Using term on text fields fails because text fields are analyzed during indexing. Query "Running shoes" won't match indexed "run shoe" tokens. Best practice (2025): Use match for text field types, term for keyword field types. For performance-critical exact matching, use keyword fields with term queries. Configure field mapping: {"status": {"type": "keyword"}} for exact matching, {"description": {"type": "text"}} for full-text search.
Architecture hierarchy: Elasticsearch index contains multiple shards. Each shard is a complete Apache Lucene index. Each Lucene index contains multiple immutable segments (mini-indexes). Each segment contains inverted index structure mapping terms to document IDs. Inverted index: Processes documents to extract unique terms/tokens, records which documents contain each term. Example: Term "docker" → [doc1, doc5, doc12]. Search workflow: Query searches each segment sequentially within shard, combines results. Segments structure: Contains inverted index (term→document mappings), stored fields (original JSON), doc values (columnar data for sorting/aggregations), norms (field length for BM25 scoring). Immutability benefits: Concurrent reads without locks, aggressive OS filesystem caching (30-50% speedup), better compression (40-60%). Segment lifecycle: New documents create new segments, updates/deletes marked in .liv files (not modified in-place), background merging combines small segments into larger ones while removing deleted docs. 2025 enhancements: Elasticsearch 8.x integrates latest Lucene advancements including enhanced I/O parallelism and specialized HNSW graph merging for vector search.
License change trigger: On January 21, 2021, Elastic NV changed Elasticsearch/Kibana licensing from permissive Apache License 2.0 to dual licensing under SSPL (Server Side Public License) and Elastic License v2. Neither SSPL nor ELv2 are OSI-approved open-source licenses. Elastic's goal: Prevent cloud providers (primarily AWS) from offering "Elasticsearch as a service" without contributing back to Elastic commercially. AWS response: April 2021 - Forked last Apache 2.0 version (Elasticsearch 7.10.2, Kibana 7.10.2) to create OpenSearch and OpenSearch Dashboards. Removed all Elastic proprietary code (X-Pack commercial features including security, ML, monitoring). First GA release: OpenSearch 1.0 in July 2021, just 3 months after fork announcement. License preservation: OpenSearch remains 100% Apache 2.0 with no proprietary elements, ensuring vendor neutrality. 2024-2025 developments: September 2024 - AWS transferred OpenSearch governance to Linux Foundation, establishing OpenSearch Foundation for vendor-neutral stewardship. Elastic added AGPLv3 license option in September 2024, making Elasticsearch officially open-source again alongside SSPL/ELv2 options. Current status (2025): Both projects are open-source - OpenSearch under Apache 2.0, Elasticsearch under AGPLv3/SSPL/ELv2 tri-licensing.
In September 2024, Elastic added Open Source Initiative (OSI) approved AGPLv3 license as option alongside SSPL and Elastic License v2 (ELv2). This made Elasticsearch officially open source again after 2021's controversial move to SSPL (source-available, not open source). Background: In 2021, Elastic changed from Apache 2.0 to dual licensing under SSPL 1.0 and ELv2 with 7.11 release. This 2021 change caused AWS to fork Elasticsearch 7.10.2 to create OpenSearch. With AGPLv3 addition in 2024, source code now available under three licenses: SSPL 1.0, AGPLv3, and ELv2 - giving users choice. Significance (2025): AGPLv3 enables customers and community to use, modify, redistribute, and collaborate on Elastic's source code under well-known open-source license. Addition doesn't affect existing SSPL or ELv2 users, no change to binary distributions. This licensing flexibility allows Elastic to regain "open source" classification while offering proprietary options. Critical difference: AGPLv3 is OSI-approved open source, SSPL is not (considered source-available). AGPLv3 copyleft provision requires source code disclosure if modified software used as network service. Impact: Addressing community trust concerns from 2021 license change.
Use Algolia for: (1) E-commerce and content discovery where search is core feature requiring instant results (1-20ms query latency at scale). (2) Rapid deployment needs - fully managed SaaS eliminates infrastructure overhead, setup takes minutes vs days. (3) Typo tolerance and relevance tuning out-of-box (Damerau-Levenshtein algorithm). (4) Limited technical resources - minimal DevOps requirements. Pricing: Record-based (1M records/month ~$1,500+), predictable scaling. Use Elasticsearch for: (1) Complex analytics beyond search - log analysis, APM, SIEM (Elastic Stack ecosystem). (2) Deep customization - modify scoring algorithms, build custom analyzers, plugins. (3) Cost-sensitive large-scale deployments - self-hosted Elasticsearch significantly cheaper at 100M+ documents (compute-based pricing vs record-based). (4) Strong technical teams comfortable managing infrastructure. (5) Full-text search with aggregations, filtering across high-cardinality fields. Performance: Both achieve <50ms with proper tuning. Algolia's edge: managed infrastructure, global CDN. Elasticsearch's edge: on-premises control, unlimited customization. Critical difference (2025): Algolia = speed + convenience premium, Elasticsearch = flexibility + cost efficiency at scale. Decision point: If search budget >$5K/month and team has DevOps capacity, evaluate Elasticsearch self-hosted. If search is mission-critical but team is small, Algolia's managed service reduces operational risk.
- Bucket aggregations: Group documents into buckets based on field values, ranges, or criteria. Examples: terms (top N field values like product categories), histogram (numeric ranges), date_histogram (time intervals), range (custom buckets), filters (multiple filter criteria). Use cases: Faceted search filters, time-series analysis, category analytics. Example: {"terms": {"field": "category.keyword", "size": 10}} groups products by category. 2. Metric aggregations: Calculate statistics from field values - mathematical operations like COUNT, SUM, MIN, MAX, AVERAGE, CARDINALITY. Can be top-level or sub-aggregations within buckets. Examples: avg, sum, min, max, stats, percentiles, cardinality. Use case: Calculate average price per category. Example: {"avg": {"field": "price"}} calculates average price. 3. Pipeline aggregations: Take input from OTHER aggregations (not documents/fields), enabling chaining and transformations. Two families: Parent (add data to existing buckets - derivative, cumulative_sum, moving_avg) and Sibling (create new metric from sibling buckets - min_bucket, max_bucket, avg_bucket). Use case: Calculate month-over-month growth rate using derivative on date_histogram. Example: {"derivative": {"buckets_path": "sales>total"}} calculates change between consecutive buckets. Best practice (2025): Combine all three types - bucket to group, metric to calculate per bucket, pipeline to analyze trends across buckets.
Parent and sibling pipeline aggregations differ fundamentally in structural positioning and how they reference other aggregations. Parent pipeline aggregations are NESTED INSIDE their parent multi-bucket aggregation, adding computed metrics directly to each bucket's output. They operate on parent aggregation metrics using buckets_path with relative paths (just the metric name without parent prefix). Examples: derivative (calculates rate of change between consecutive buckets - month-over-month growth), cumulative_sum (running total across all buckets), moving_avg (7-day moving average), moving_fn (custom moving window), serial_diff (period-over-period comparison). Use case: Time-series analysis like calculating daily sales velocity or trend smoothing. Parent aggregation structure: {"date_histogram": {"field": "@timestamp", "interval": "month"}, "aggs": {"total_sales": {"sum": {"field": "amount"}}, "sales_derivative": {"derivative": {"buckets_path": "total_sales"}}}}. Each bucket receives the derivative value. Sibling pipeline aggregations sit NEXT TO (at the same hierarchy level as) their referenced aggregation, producing independent output summary metrics from sibling bucket results. They use buckets_path with full absolute paths like "parent_agg_name>metric_name" or nested paths like "agg1>agg2>metric". Examples: avg_bucket (average of all bucket metric values), min_bucket (identifies lowest bucket), max_bucket (identifies highest), sum_bucket (total across buckets), stats_bucket (comprehensive statistics including min/max/avg), percentiles_bucket. Use case: Finding highest monthly sales value or calculating average across all monthly totals. Sibling aggregation structure: {"date_histogram": {"field": "@timestamp", "interval": "month"}, "aggs": {"total_sales": {"sum": {...}}}}, "max_monthly_sales": {"max_bucket": {"buckets_path": "sales_per_month>total_sales"}}}}. Max_monthly_sales sits at top level as independent aggregation. Critical structural difference: Parent aggregations ENRICH existing buckets (add output columns), sibling aggregations SUMMARIZE buckets (create independent summary at top level). buckets_path syntax differs significantly: Parent uses simple relative path "metric_name" referencing sibling metrics within same parent. Sibling uses full absolute path "parent_agg>metric" traversing aggregation hierarchy. Special keywords for both: "_count" (document count in each bucket), "_key" (bucket key like timestamp). Gap policy parameter (skip/insert_zeros) handles missing bucket values gracefully. Critical constraint: Pipeline aggregations process ONLY aggregation outputs, never document fields.
Synonym filters expand search terms with equivalent terms. Two types: (1) synonym_graph (recommended for search analyzers) - correctly handles multi-word synonyms like "ipod, i-pod, i pod" using token graphs. (2) synonym (legacy) - simpler but breaks multi-word synonyms, deprecated for search-time use. Application timing: (1) Index-time synonyms: Applied during indexing, expand terms before storing in inverted index. Pros: faster queries. Cons: requires full reindex to update synonyms, increases index size. (2) Search-time synonyms: Applied during query analysis only. Pros: update synonyms without reindexing (set "updateable": true), smaller index. Cons: slightly slower queries. Modern approach (2025): Since Elasticsearch 8.13, use Synonyms Management APIs instead of synonym files. Create synonym set: PUT /_synonyms/my-synonyms {"synonyms_set": [{"synonyms": "laptop, notebook, computer"}]}. Reference in analyzer: "filter": [{"type": "synonym_graph", "synonyms_set": "my-synonyms", "updateable": true}]. Update without reindex: PUT /_synonyms/my-synonyms/laptop {"synonyms": "laptop, notebook, computer, chromebook"}. Synonym formats: (1) Equivalent: "ipod, i-pod, i pod" (all interchangeable). (2) Explicit mappings: "universe, cosmos => cosmos" (only expands left side to right). Best practice (2025): Use search-time synonym_graph filter with Synonyms API and "updateable": true for maximum flexibility. Reserve index-time for static taxonomies that never change.
Okapi BM25 (Best Matching 25) is probabilistic ranking function calculating document relevance scores using three factors: (1) Term frequency (TF) with saturation, (2) Inverse document frequency (IDF) - rare terms score higher, (3) Document length normalization - prevents long documents from unfairly dominating. Formula: score = IDF(q) * (TF(q,d) * (k1 + 1)) / (TF(q,d) + k1 * (1 - b + b * (|d| / avgdl))) where k1=1.2 (default, controls TF saturation) and b=0.75 (default, controls length normalization). Adoption timeline: Elasticsearch 5.0 (2016) and Apache Lucene 6.0 switched from TF-IDF to BM25 as default similarity algorithm. Reasons for switch: TF-IDF shortcomings include no document length consideration and unsaturated term frequency (keyword stuffing inflates scores). BM25 improvements: Better term frequency saturation (diminishing returns for repetition), superior document length normalization, better relevance in production tests. Current status (2025): BM25 remains default scoring algorithm in all modern Elasticsearch versions (8.x). Configuration: Customize per-field with {"similarity": {"type": "BM25", "k1": 1.5, "b": 0.8}} in index mapping. Best practice: Use default parameters (k1=1.2, b=0.75) unless specific ranking issues identified through A/B testing.
Default values: k1 = 1.2 (term frequency saturation parameter), b = 0.75 (document length normalization factor). These defaults from academic research work well for 90%+ of use cases. How they work: (1) k1 controls term frequency saturation curve. Higher k1 (e.g., 2.0) gives more weight to repeated terms - good for technical docs where repetition signals relevance. Lower k1 (e.g., 0.8) reduces impact of term repetition - good for marketing content with keyword stuffing. Range: 1.2-2.0 typical, rarely goes below 1.0. (2) b controls document length normalization. b=1.0 fully normalizes by length (penalizes long docs heavily). b=0.0 disables normalization (favors long comprehensive docs). b=0.75 balances both. Tuning guidance (2025): Start with defaults (k1=1.2, b=0.75). If short docs rank too low, decrease b to 0.5-0.6. If keyword-stuffed docs rank too high, increase k1 to 1.5-2.0. Configuration: Set per-field in mapping with "similarity": {"my_custom_bm25": {"type": "BM25", "k1": 1.5, "b": 0.8}}. Real-world impact: Adjusting k1 from 1.2 to 1.8 in technical documentation improved precision@10 by 12% in A/B tests. Most users should stick with defaults unless specific ranking issues identified through user testing.
Algolia uses Damerau-Levenshtein distance algorithm for fuzzy matching, calculating edit distance between query and indexed terms. Supported operations: insertion, deletion, substitution, transposition (swapping adjacent characters). Typo handling rules: (1) Words with 1-3 characters: no typos allowed (too short for reliable fuzzy matching). (2) Words with 4-7 characters: 1 typo allowed. (3) Words with 8+ characters: 2 typos allowed. (4) Exception: 3 typos allowed if first typo is on initial letter (accounts for common typing mistakes). Configuration parameters: minWordSizefor1Typo (default: 4), minWordSizefor2Typos (default: 8). Ranking impact: Typo count is PRIMARY ranking criterion before all other signals. Ranking order: exact match (0 typos) > 1 typo > 2 typos. Within same typo count, other ranking criteria apply (custom ranking, text relevance, geo distance). Advanced control: Set typoTolerance per query: "true" (default), "false" (strict matching), "min" (reduce typo allowance by 1), "strict" (disable prefix matching). Use disableTypoToleranceOnWords: ["brand", "iphone"] to require exact matches for specific terms. Performance: Typo tolerance adds minimal overhead (<2ms) due to optimized prefix trees. User impact: Improves conversion rates 15-30% for e-commerce search by handling "ipone" → "iphone", "lapto" → "laptop". Best practice (2025): Enable typo tolerance globally, disable for brand names and SKUs using disableTypoToleranceOnWords or disableTypoToleranceOnAttributes.
An Elasticsearch segment is a self-contained Apache Lucene mini-index - a complete, immutable snapshot of documents written at a specific point in time. Architecture: Each shard contains multiple segments. Each segment contains four core components: (1) inverted index (term → [docID, frequency, positions] mappings enabling O(1) lookups), (2) stored fields (original JSON documents), (3) doc values (column-oriented data for sorting/aggregations), (4) norms (field length metadata for BM25 scoring). Segments are immutable by design - once written to disk via fsync, they never change. Why immutable (write-once design)? Benefits are significant: (1) Concurrent lock-free reads - multiple queries scan identical segment simultaneously with zero contention, enabling 1000+ QPS per shard without synchronization overhead. (2) OS filesystem caching - immutable files persist in OS page cache indefinitely without cache invalidation. Result: 30-50% query speedup and faster aggregations via disk I/O reduction. (3) Aggressive compression - write-once guarantee enables delta-encoding, variable-byte encoding, and frame-of-reference compression (40-60% smaller than mutable structures). (4) Simplified data structures - no concurrent write handling complexity, no versioning overhead. Deletion handling: Documents never deleted in-place. Instead: (1) Updates/deletes mark document as "deleted" in lightweight .liv (live docs) bitmap file, (2) deleted documents excluded from search results but still occupy disk space, (3) actual deletion happens only during segment merging when deleted docs physically removed. Segment merging: Background TieredMergePolicy combines small segments into larger ones (10 segments per shard target), removing deleted documents, reclaiming disk space. Merge cost: I/O intensive but throttled (default 20MB/s) to avoid impacting query performance. Real-world impact: This write-once, immutable-by-design architecture is fundamental to Elasticsearch's ability to handle billions of time-series documents efficiently - new data writes append-only to new segments while billions of historical documents remain in immutable segments optimized for reading.
Replica shards improve query performance through intelligent distribution and load balancing. Replicas enable horizontal read scaling because both primary and replica shards handle search requests identically, while only primary shards accept writes. Throughput improvement: With N replica shards, cluster can process approximately N+1 times more concurrent queries for same data. Example: 1 primary + 2 replicas = 3x query throughput vs 0 replicas. Query routing mechanism: Elasticsearch uses Adaptive Replica Selection (ARS), enabled by default in version 7+, which routes each search request to the best-performing shard copy using EWMA (exponentially weighted moving average) metrics: (1) Service time EWMA (how long prior searches took on each node), (2) Response time EWMA (network latency from coordinating node), (3) Search queue depth EWMA (number of queued search requests). This avoids routing queries to degraded nodes experiencing GC pauses, disk I/O, or network saturation. Benchmark results: Under load, ARS improved throughput 113% (41→88 queries/sec) and reduced p90 latency 64.7% (5,215ms→1,839ms) and p99 latency 60.6% (6,181ms→2,434ms). Critical distinction: Replicas improve READ throughput only. Writes (index, update, delete) require replication to all replicas before confirming, so more replicas INCREASE write latency—replicas don't improve write performance. Single-query latency unchanged: Adding replicas doesn't reduce latency of individual queries. Multiple concurrent queries benefit when distributed across replicas. Geographic distribution benefit: Placing replicas in different availability zones reduces network latency to geographically distributed users. Trade-offs: Storage cost (2 replicas = 3x disk space), write overhead (longer replication latency), recovery time (longer cluster rebalancing). Configuration: PUT /my-index/_settings {"number_of_replicas": 2}. Best practice (2025): 1-2 replicas for production (1 for cost-sensitive, 2 for mission-critical). Requires adequate cluster nodes—if each shard copy isn't on separate node, benefits disappear.
In September 2024, AWS transferred OpenSearch governance from AWS-controlled project to Linux Foundation, establishing the OpenSearch Foundation as independent steward. Significance: (1) Vendor neutrality - governance no longer controlled by single company (AWS). Decision-making through community consensus vs corporate directive. (2) Multi-vendor participation - enables IBM, SAP, Canonical, and other contributors to have equal standing alongside AWS. (3) Intellectual property protection - Linux Foundation provides neutral legal home for project IP. (4) Long-term sustainability - reduces risk of project abandonment if AWS priorities shift. (5) Enterprise trust - organizations wary of AWS lock-in now see OpenSearch as truly vendor-neutral like Linux, Kubernetes. Governance structure: Technical Steering Committee with representatives from multiple organizations, transparent RFC process for major changes, Apache 2.0 license unchanged (stays open source). Context: This move mirrors Elasticsearch's September 2024 AGPLv3 license addition - both projects addressing community trust concerns in 2024. Impact (2025): OpenSearch adoption increased 40%+ in Q4 2024 following announcement, particularly among enterprises running multi-cloud strategies. Critical for organizations evaluating OpenSearch vs Elasticsearch: OpenSearch now has truly neutral governance comparable to CNCF projects, while Elasticsearch remains Elastic-controlled (though with open-source AGPLv3 option).
Algolia's primary performance claims: (1) 1-20ms average query latency for most searches (targeting <50ms end-to-end for as-you-type search). (2) 12-200x faster than Elasticsearch depending on query complexity and Elasticsearch configuration. (3) <1ms response on simple single-character queries (e.g., "e") vs 202ms for single-shard Elasticsearch. (4) Consistent sub-5ms response times across typical queries (geo, batman, emilia) when benchmarked against unoptimized Elasticsearch. Architectural advantages: (1) Hardware: Bare metal with high-frequency 3.5-3.9 GHz Intel Xeon processors, indices entirely in RAM (256GB+ per shard), no Java GC pauses (uses C++ implementation). (2) Geographic: Distributed Search Network with 15+ regions providing 1-2ms latency reduction per 124 miles of distance, automatic replication across regions. (3) Pre-computation: Results pre-sorted at indexing time (not during query), eliminating sorting overhead. (4) Abstraction: Automatic sharding/rebalancing backend (not exposed in API), simplifying operations vs manual Elasticsearch shard management. Scale metrics (2025): 30+ billion records indexed, 1.7+ trillion searches annually. Reality check: "200x faster" claims use default Elasticsearch vs optimized Algolia on simple queries - misleading comparison. Well-tuned Elasticsearch with proper sharding, regional distribution, and caching achieves 20-50ms P95 latency, much closer to Algolia's claims. Trade-offs: Algolia wins on out-of-box performance and managed simplicity. Elasticsearch wins on flexibility (complex aggregations, joins, custom scoring), cost at 100M+ scale (self-hosted), and control. Critical context (2025): Algolia's speed claims are verifiable but optimized for e-commerce/catalog search use cases. Elasticsearch requires 3-6 months expertise investment to match similar latency but offers vastly more analytical power (APM, logs, SIEM). Choose based on use case: Algolia for search-first applications, Elasticsearch for analytics/observability or high-volume cost-sensitive deployments.
buckets_path specifies which aggregation outputs a pipeline aggregation uses as input by referencing metrics from sibling or parent aggregations. Formal syntax uses > as aggregation separator, . as metric separator, and [KEY_NAME] for multi-bucket key selection. Syntax: "buckets_path": "path>to>metric" navigates through nested aggregations to target specific metrics. Path formats: (1) Single metric: "sales>total" references total metric in sales aggregation. (2) Multi-value metrics: "stats_agg.avg" uses . separator to reference avg metric from multi-value stats aggregation. (3) Nested aggregations: "date_hist>category_terms>revenue.sum" navigates multiple aggregation levels using > separator. (4) Multi-bucket key selection: "sale_type['hat']>sales" selects specific 'hat' bucket key from multi-bucket sale_type aggregation - enables calculating metrics for individual bucket values. (5) Multiple paths (object notation): {"revenue": "sales>total", "costs": "expenses>total"} creates named variables accessible as params.revenue and params.costs in bucket_script. (6) Special keywords: "_count" (document count in bucket), "_key" (bucket key value, useful for time-series timestamp), "_value" (metric value from sibling aggregation). Comprehensive example with multi-bucket selection: {"bucket_script": {"buckets_path": {"hat_sales": "product_type['hat']>sales", "bag_sales": "product_type['bag']>sales", "shoe_sales": "product_type['shoe']>sales"}, "script": "params.hat_sales + params.bag_sales + params.shoe_sales"}}. Parent pipeline aggregations (derivative, moving_avg, cumulative_sum): use relative paths like {"buckets_path": "revenue"} referencing sibling metrics within same parent aggregation. Sibling pipeline aggregations (avg_bucket, max_bucket, sum_bucket): use full paths like {"buckets_path": "parent_agg>metric"} referencing separate sibling aggregations. Error handling: non-existent paths cause "No aggregation found" errors. Use gap_policy parameter to handle missing bucket values: "skip" (default, omit bucket), "insert_zeros" (treat as 0). Critical constraint: buckets_path references ONLY aggregation outputs, never document fields directly.
When forking Elasticsearch 7.10.2 (last Apache 2.0 version) in April 2021, Amazon removed all code incompatible with Apache 2.0 license to create OpenSearch. Removed components: (1) Entire X-Pack codebase - Elastic's commercial/proprietary features including: security (SSO, RBAC, field-level security), machine learning (anomaly detection, data frame analytics), monitoring (cluster health dashboards), alerting (commercial version), SQL query interface (commercial features), graph analytics, reporting (PDF/PNG generation). (2) Elastic-branded telemetry - phone-home metrics collection sending usage data to Elastic servers. (3) Elastic trademarks - all logos, branding, references to "Elastic" company. What Amazon replaced: (1) Security: OpenSearch Security plugin (based on Open Distro for Elasticsearch's security module, originally developed by Amazon). (2) Alerting: OpenSearch Alerting (open-source alternative). (3) Machine Learning: OpenSearch ML Commons (different architecture, k-NN focus). (4) Dashboards: OpenSearch Dashboards (fork of Kibana 7.10.2, similarly de-X-Packed). (5) SQL: OpenSearch SQL (open-source query interface). License cleanup impact: OpenSearch codebase is 100% Apache 2.0 with no proprietary elements. Clean IP allows community contributions without license concerns. Feature parity (2025): OpenSearch has rebuilt most X-Pack functionality under Apache 2.0, though some advanced ML features still lag behind Elastic's commercial offerings. Critical context: This removal was necessary because Elastic's 2021 license change (Apache 2.0 → SSPL/ELv2) made X-Pack code unusable in Apache 2.0 fork.
Match query executes slower than term query because it applies text analysis to the query string before searching the inverted index, while term query performs direct literal lookups without any preprocessing. Analysis phase adds measurable overhead: (1) Tokenization - breaking query into individual tokens based on tokenizer rules (standard tokenizer splits on whitespace and punctuation), (2) Lowercasing - normalizing case ("GET" → "get", "Running" → "running"), (3) Token filtering - removing stopwords ("the", "is", "and"), applying stemming rules ("running" → "run", "foxes" → "fox"), expanding synonyms ("laptop" → ["laptop", "notebook", "computer"]). This multi-step analysis adds 1-5ms latency per query. Term query skips all analysis: uses query string exactly as provided, performs single direct index lookup (~<1ms), returns binary match without relevance scoring. Performance measurements: Benchmark tests show term query ~2-3ms average latency while match query averages 3-5ms for typical single-field queries. Gap widens with complex analyzers (custom tokenizers, multiple filters, synonym expansion can add 2-10ms). However, absolute differences matter less than throughput: single-query perspective is milliseconds difference; high-volume perspective (10,000 queries/sec) multiplies by query volume. When to use each: Match query for full-text search on text fields (product descriptions, article content, user input where analysis improves matching). Match analyzes user input matching it against similarly-analyzed indexed terms - "Running shoes" query analyzes to ["run", "shoe"] tokens matching indexed documents. Term query for exact matching on keyword fields (UUIDs, status values like "ACTIVE", category IDs, email addresses). Term searches unanalyzed keyword fields where analysis would break matching (query "[email protected]" must match exactly, not tokenized). Critical mistake: Using term query on text fields causes mismatches because text fields store analyzed tokens while term query searches for literal strings. Query for "Running shoes" on text field searches for exact phrase token, finds nothing because indexed text contains ["run", "shoe"] tokens separately. Best practice (2025): Always match query to field type - text fields require match (or multi_match, match_phrase), keyword fields use term (or terms). For performance-critical exact matching, map as keyword type enabling efficient term queries. Use filter context instead of query context on term queries to leverage segment-level caching (2-10x faster than query context, skips scoring).
Default values: minWordSizefor1Typo = 4 characters, minWordSizefor2Typos = 8 characters. These control typo tolerance thresholds per word length. How they work: (1) Words with 1-3 characters: No typos allowed (exact match required). Rationale: Short words like "cat", "dog", "car" have too few characters for reliable fuzzy matching - 1 typo would change word entirely. (2) Words with 4-7 characters: 1 typo allowed (uses minWordSizefor1Typo = 4 threshold). Example: "ipone" matches "iphone" (1 character insertion). (3) Words with 8+ characters: 2 typos allowed (uses minWordSizefor2Typos = 8 threshold). Example: "elastcserch" matches "elasticsearch" (1 deletion + 1 transposition). Customization use cases: Strict matching for brands/SKUs: Increase minWordSizefor1Typo to 6 or disable typos entirely for specific attributes using disableTypoToleranceOnAttributes. More forgiving for international names: Decrease minWordSizefor1Typo to 3 if users frequently misspell short foreign words. Technical products: Decrease minWordSizefor2Typos to 6 if product names like "router" (6 chars) frequently have 2-typo queries like "routter". Configuration: Set in index settings: {"minWordSizefor1Typo": 5, "minWordSizefor2Typos": 10} or per-query: searchParameters: {"typoTolerance": "min", "minWordSizefor1Typo": 5}. Real-world impact: Default values (4, 8) optimized for e-commerce and content search based on Algolia's analysis of billions of queries. Most use cases should use defaults. Only adjust if seeing too many false positives (lower threshold) or too many missed matches (higher threshold). Best practice (2025): Start with defaults, adjust only after analyzing search logs showing specific typo patterns in your domain.
BM25 score formula: score = IDF(q) * (TF(q,d) * (k1 + 1)) / (TF(q,d) + k1 * (1 - b + b * (|d| / avgdl))). Components breakdown: (1) IDF(q) - Inverse Document Frequency: Measures term rarity across corpus. Formula: log(1 + (N - df + 0.5) / (df + 0.5)) where N = total docs, df = docs containing term. Rare terms get higher IDF scores (more valuable for relevance). (2) TF(q,d) - Term Frequency: How many times query term q appears in document d. Raw count, not normalized. More occurrences = higher relevance (with saturation). (3) k1 parameter (default: 1.2): Controls term frequency saturation. Higher k1 = less saturation, repeated terms matter more. Lower k1 = faster saturation, diminishing returns for repetition. (4) b parameter (default: 0.75): Controls document length normalization. b=1 fully normalizes (penalizes long docs), b=0 disables normalization (favors long docs). (5) |d| - Document length (word count in doc). (6) avgdl - Average document length across corpus. (7) boost - Optional query-level multiplier (default: 1.0). Simplified intuitive formula: score ≈ boost * IDF * (TF with saturation and length normalization). Example calculation: Query "docker" in document containing "docker" 3 times, 100 words long, avgdl=150, k1=1.2, b=0.75, IDF=2.5. TF component = (3 * 2.2) / (3 + 1.2 * (1 - 0.75 + 0.75 * (100/150))) ≈ 2.8. Final score ≈ 2.5 * 2.8 ≈ 7.0. Why BM25 over TF-IDF: Better handling of term frequency saturation (prevents keyword stuffing from inflating scores), improved document length normalization. Elasticsearch adoption: Default since version 5.0 (2016), replacing classic TF-IDF similarity.
Recommendation (2025): Use search-time synonyms for maximum flexibility and maintainability - advantages significantly outweigh minimal performance cost. Search-time is the clear best practice, with index-time justified only in rare scenarios. SEARCH-TIME SYNONYMS (RECOMMENDED): Advantages: (1) Update without reindexing - Set "updateable": true and modify synonyms via Synonyms Management API (Elasticsearch 8.10+) without reindexing any data (hours of reindex work becomes <1 second). (2) Smaller index size - Synonyms stored separately from inverted index, not duplicated across documents, reducing disk footprint 10-30% on synonym-heavy domains. (3) Fast iteration - Test changes instantly, A/B test variants, respond rapidly to trending terms (new product names, brand variations, emerging slang). (4) Centralized management - Single synonym set referenced by multiple analyzers ensures consistent behavior; changes propagate automatically. (5) Enables synonym_graph filter - Correctly handles multi-word synonyms ("credit card" ↔ "cc") using token graphs; legacy synonym filter breaks phrase matching. Disadvantages: ~1-3ms additional query latency per search for real-time synonym expansion (negligible for most applications). INDEX-TIME SYNONYMS: Advantages: (1) Marginally faster queries - Synonyms pre-expanded during indexing provides ~2% improvement (only measurable at sub-10ms latency budgets). (2) Works reliably with match_phrase queries - Phrase position matching unaffected. Disadvantages: (1) Requires full reindex to update synonyms - Every synonym change forces reindexing all documents (hours-days for large indexes, operational complexity). (2) Larger index size - All synonym expansions stored in inverted index, inflating disk 20-50%. (3) Statistics skewing - Expanded terms have artificially high frequency, degrading BM25 relevance scoring. (4) Difficult A/B testing - Can't test synonym changes without expensive full reindex. MODERN BEST PRACTICE (Elasticsearch 8.10+): Use Synonyms Management API with search-time synonym_graph. Create: PUT /_synonyms/product-synonyms {"synonyms_set": [{"synonyms": "laptop, notebook, computer, chromebook"}, {"synonyms": "quick, fast, rapid, speedy"}]}. Reference in search analyzer: {"filter": [{"type": "synonym_graph", "synonyms_set": "product-synonyms", "updateable": true}]}. Update on-the-fly: PUT /_synonyms/product-synonyms/laptop {"synonyms": "laptop, notebook, computer, chromebook"} then POST /my-index/_reload_search_analyzers (zero downtime, applies immediately). HYBRID APPROACH (RARE): Index-time only for ultra-stable domain-specific taxonomies (legal terms, regulatory categories) that change <1x per year AND have latency budget <5ms. Otherwise use search-time. Critical: Use synonym_graph filter (not deprecated synonym filter) - only synonym_graph correctly handles multi-word synonyms without breaking phrase queries. Exception: Index-time only if absolute latency requirement <5ms AND no synonym changes expected annually. For 99%+ of production use cases, search-time with Synonyms API is optimal.
Elasticsearch segment merging is an automatic background process that combines multiple small immutable Lucene segments into fewer larger segments within a shard, balancing search performance with indexing overhead. Purpose: (1) Reclaim disk space by physically expunging soft-deleted documents (marked in .liv bitmap files, not actually deleted), (2) Improve query performance by reducing segment count (searching 10 segments faster than 100), (3) Lower memory pressure from excessive segment state overhead. How it works: Elasticsearch monitors each shard's segment sizes and count. TieredMergePolicy (default since Lucene 5.0) calculates a logarithmic staircase budget based on index size and parameters. When actual segment count exceeds budget, the merge scheduler selects candidates: reads source segments into memory, combines inverted indexes, skips deleted documents, writes single merged segment to disk. Once merged segment is fsync'd, atomic switch replaces old segments (deleted after all queries finish). Key parameters (configurable): index.merge.policy.max_merged_segment (default 5GB - segments larger than this excluded from merging), index.merge.policy.segments_per_tier (default 10 - target segment count per tier), index.merge.policy.floor_segment (default 2MB - minimum segment size), index.merge.policy.max_merge_at_once (default 10 - max segments merged in single operation), index.merge.policy.expunge_deletes_allowed (default 10% - threshold to trigger delete expunging). Throttling mechanism: Auto-throttling limits merge I/O to prevent query performance degradation. Default: 20MB/s limit (configurable via index.merge.scheduler.max_merge_inprogress). When merge I/O exceeds this, indexing operations are delayed automatically. Control threads: index.merge.scheduler.max_thread_count (default Math.max(1, cores/2)) - higher allows parallel merges. Performance impact: Merging is write-amplification intensive (ratio of bytes processed vs final index size). Typical merge storms occur with heavy indexing (10M+ docs/hour), creating many small segments, causing sustained high I/O and CPU. Tradeoff: Higher refresh_interval (default 1s) reduces merge frequency but increases search latency. Force merge: POST /my-index/_forcemerge?max_num_segments=1 produces single segment, optimizing searches but preventing future merges (creates 5GB+ segments that won't merge). Critical warning: Force merging active write indexes causes problems - segments cannot participate in automatic merges, accumulating deletions, degrading performance. Production best practice (2025): Let automatic TieredMergePolicy handle routine merging. Use forcemerge only on read-only indices (completed time-series data) for snapshot optimization. Monitor merge activity: GET /_stats, /_cat/indices?v&h=merges.current,merges.total. For write-heavy workloads, consider staggered shard allocation across data tiers.
This statement is incorrect - Elasticsearch has the significantly steeper learning curve, not Algolia. Algolia advantages: User-friendly interface, intuitive APIs, comprehensive SDKs for major languages, polished developer experience, reduced maintenance needs, quicker implementation (minutes vs days). Setup is straightforward with minimal configuration. Learning curve only steepens when deep customization needed (advanced ranking algorithms, tie-breaking system). Elasticsearch challenges (2025): Notoriously steep learning curve requiring extensive technical expertise. Setup complexity: Must configure Java runtime, network ports, security settings, node roles, shard allocation, index mappings, replica strategies. Production-ready cluster requires understanding cluster health monitoring, split-brain prevention, master node election. Ranking complexity: Getting results properly ranked requires understanding BM25 parameters, custom similarity functions, function_score queries, boosting strategies. DevOps overhead: Self-hosted Elasticsearch requires ongoing management - capacity planning, backup strategies, version upgrades, security patching, performance tuning. Concepts to master: Inverted indexes, segment merging, refresh intervals, translog, circuit breakers, fielddata cache. Time investment: Elasticsearch expertise typically requires 3-6 months learning curve vs Algolia's days/weeks. Best practice (2025): Choose Algolia for rapid deployment with limited DevOps resources. Choose Elasticsearch when team has dedicated DevOps capacity and needs deep analytical capabilities beyond search.
Parent pipeline aggregation that executes custom scripts to perform per-bucket calculations on metrics from parent multi-bucket aggregations. Enables complex math operations combining multiple aggregation outputs within each bucket. Requirements: Input metrics must be numeric, script must return numeric value. Use cases: Calculate ratios (conversion rate = conversions / visits), profit margins (revenue - cost), percentages, ROI, growth rates. Configuration syntax: Uses buckets_path parameter where key is variable name for script, value is path to metric. Format: "aggregation_name>metric_name" for single path or object notation for multiple paths. Example calculating profit per month: {"bucket_script": {"buckets_path": {"revenue": "sales>total_revenue", "cost": "expenses>total_cost"}, "script": "params.revenue - params.cost"}}. Real-world example from Elasticsearch docs: Calculate t-shirt sales percentage of total monthly sales using date_histogram with filtered sub-aggregations, then bucket_script: {"buckets_path": {"tShirts": "t-shirts>sales", "total": "total>sales"}, "script": "params.tShirts / params.total * 100"}. Path navigation: Use '>' as aggregation separator, '.' as metric separator. Example: "my_bucket>my_stats.avg" references avg value in my_stats metric within my_bucket aggregation. Special keywords: "_count" (bucket document count), "_key" (bucket key value). Best practice (2025): Use bucket_script for business metrics calculations. Combine with date_histogram for time-series analytics, terms for category-based calculations.