Search-time implementation: Set updateable: true and modify synonyms via Synonyms Management API (Elasticsearch 8.10+) without reindexing. Example: PUT /_synonyms/my-synonyms {"synonyms_set": [{"synonyms": "laptop, notebook, computer"}]}. Apply to search analyzer only (not index analyzer). Configuration: search_analyzer with synonym or synonym_graph filter. Update synonyms: POST /_synonyms/my-synonyms with new list. Changes apply immediately to all searches. No downtime, no reindexing required.
Elasticsearch Text Analysis FAQ & Answers
18 expert Elasticsearch Text Analysis answers researched from official documentation. Every answer cites authoritative sources you can verify.
General
18 questionsIndex-time synonyms are justified only in rare scenarios: (1) Synonyms absolutely never change (extremely rare - language evolves), (2) Query performance critical and cannot tolerate 1-2ms search-time expansion overhead (sub-10ms queries), (3) Disk space cheaper than operational complexity. Drawbacks: Must reindex entire dataset for any synonym update (hours for large indexes, production downtime), index bloat (each synonym variation stored in inverted index), harder to debug. Reality: Less than 5% of Elasticsearch deployments use index-time synonyms since updateable search-time synonyms (8.10+) eliminated the primary justification.
The synonym_graph token filter correctly handles multi-word synonyms by creating a proper token graph with positionLength attributes. The standard synonym filter does not handle multi-word synonyms correctly - it ignores positionLength and produces invalid token graphs. For multi-word synonyms (e.g., 'ny => new york'), always use synonym_graph at search-time. The synonym_graph filter was added in Lucene 6.4.0 (Elasticsearch 5.2.0) and replaced the deprecated SynonymFilter for this use case.
An Elasticsearch analyzer contains three components applied in order: (1) Character filters - zero or more filters that transform the raw text stream by adding, removing, or changing characters before tokenization. (2) Tokenizer - exactly one required tokenizer that breaks the character stream into individual tokens (usually words) and records position and offsets. (3) Token filters - zero or more filters that modify the token stream (lowercase, stemming, synonyms, stop words). Built-in analyzers pre-package these components; custom analyzers let you combine them freely.
The default max_token_length for the Elasticsearch standard tokenizer is 255 characters. If a token exceeds this length, it is split at max_token_length intervals. This default also applies to the classic tokenizer and char_group tokenizer. You can configure a custom value in the analyzer settings: {"type": "standard", "max_token_length": 512}.
Elasticsearch Synonyms API limits synonym sets to a maximum of 10,000 synonym rules per set. If you need to manage more synonym rules, you must create multiple synonyms sets. This limit exists to prevent performance degradation during synonym expansion at search time.
Use the _reload_search_analyzers API: POST /{index}/_reload_search_analyzers. This reloads search analyzers with updateable: true to pick up changes to synonym files or synonym sets. The API reloads per-node (not per-shard), so total shard count in response may differ from index shards. After reloading, clear request cache: POST /{index}/_cache/clear?request=true to ensure stale cached responses are not returned.
Elasticsearch provides three built-in character filters: (1) html_strip - strips HTML elements like and decodes HTML entities like & to &. Uses Lucene's HTMLStripCharFilter. Can configure escaped_tags to preserve specific HTML tags. (2) mapping - replaces specified strings with specified replacements using a mappings array. (3) pattern_replace - uses Java regular expressions to match and replace characters. Warning: replacements that change text length will cause incorrect highlighting.
Explicit mapping uses => to define one-way expansion: 'laptop, notebook => computer' means searching 'laptop' or 'notebook' matches documents containing 'computer', but not vice versa. Implicit mapping (comma-separated without =>) creates bidirectional equivalence: 'laptop, notebook, computer' means all three terms match each other. Best practice: use explicit mappings for precision control. Implicit mappings can cause unexpected matches and degrade search precision when one term is much broader than others.
The standard analyzer is the default analyzer in Elasticsearch and includes: (1) No character filters. (2) Standard tokenizer - grammar-based tokenization using Unicode Text Segmentation algorithm (Unicode Standard Annex #29), works well for most languages. (3) Two token filters: lowercase filter and stop token filter (disabled by default, empty stopwords list). Configurable parameters: max_token_length (default 255) and stopwords (default empty).
The Synonyms Management API was introduced in Elasticsearch 8.10 as a technical preview and became stable in Elasticsearch 8.13. It allows managing synonyms in an internal system index via REST API without file management. Key endpoints: PUT /_synonyms/{set_id} to create/update, GET /_synonyms/{set_id} to retrieve, DELETE /_synonyms/{set_id} to remove. When a synonyms set is updated, search analyzers using it are automatically reloaded on all indices.
Elasticsearch provides two tokenizers for partial word matching: (1) ngram tokenizer - breaks text into words at specified characters, then generates sliding window n-grams. Example: 'quick' with min_gram=2, max_gram=3 produces [qu, ui, ic, ck, qui, uic, ick]. Good for substring matching. (2) edge_ngram tokenizer - generates n-grams anchored to the start of each word. Example: 'quick' produces [q, qu, qui, quic, quick]. Ideal for autocomplete/search-as-you-type functionality.
No. If updateable is set to true for a synonym or synonym_graph token filter, the corresponding analyzer can ONLY be used as a search_analyzer. It cannot be used for indexing. Attempting to use an updateable synonym filter in an index analyzer will result in an error. This restriction exists because search-time synonym expansion requires different token graph handling than index-time.
The keyword tokenizer is a no-op tokenizer that outputs the entire input string as a single token, unchanged. Use cases: (1) Exact match fields where you want the whole value searchable as one unit (email addresses, product codes, IDs). (2) Combined with token filters like lowercase for case-insensitive exact matching. (3) Structured data that should not be broken into words. For keyword fields that need no analysis at all, use the keyword field type instead of text with keyword tokenizer.
Use the escaped_tags parameter to specify HTML tags that should NOT be stripped. Example configuration: {"type": "html_strip", "escaped_tags": ["b", "i", "em", "strong"]}. This preserves bold and italic tags while stripping all other HTML. The filter still decodes HTML entities like & to & even for preserved tags. Useful when you need to index HTML content but want to preserve certain formatting tags for display.
When Elasticsearch security features are enabled, you must have the 'manage' index privilege for the target data stream, index, or index alias to use the _reload_search_analyzers API. This is a relatively high privilege level - the same privilege required for operations like force merge, refresh, and close/open index.
The uax_url_email tokenizer is like the standard tokenizer but recognizes URLs and email addresses as single tokens instead of breaking them apart. Standard tokenizer breaks '[email protected]' into ['user', 'example.com'] and 'https://example.com/path' into multiple tokens. The uax_url_email tokenizer keeps these as single tokens: ['[email protected]'] and ['https://example.com/path']. Essential for indexing content where searching by complete URLs or email addresses is required.
Search-time synonym expansion adds approximately 1-2 milliseconds of query latency compared to index-time synonyms. This overhead is negligible for most applications. The tradeoff is worth it because search-time synonyms: (1) require no reindexing when synonyms change, (2) produce smaller indexes (no duplicate term storage), (3) allow updateable: true for zero-downtime synonym updates. Only consider index-time synonyms for extreme low-latency requirements (sub-10ms queries) where every millisecond matters.