Set HNSW parameters with higher values for maximum recall: m=32-64 (graph connections) and ef_construct=200-512 (build-time neighbors). Example: qdrant_client.create_collection(collection_name='high_recall', vectors_config=models.VectorParams(size=1536, distance=Distance.COSINE), hnsw_config=models.HnswConfigDiff(m=64, ef_construct=512)). Combine with search-time hnsw_ef=128-256 for 99%+ recall. Trade-offs: higher m increases memory (64 connections vs default 16), higher ef_construct increases indexing time. For 99% recall: use m=32 minimum with ef_construct=200+ for balanced builds, or m=64 with ef_construct=512 for maximum precision. Documented case shows 99.95% recall achievable with proper tuning.
Qdrant FAQ & Answers
50 expert Qdrant answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
50 questionsSet higher ef_construct during index creation to maintain recall when using lower ef_search values. Configuration: models.HnswConfigDiff(ef_construct=256). The ef_construct (build-time) parameter determines graph quality - higher values create denser connections that maintain accuracy even with lower ef_search (query-time). Default ef_construct=100 causes recall drops with ef_search<128. For production: use ef_construct=200-256 minimum when queries use ef_search<128. The ef_construct sets upper bound on achievable recall regardless of query-time ef value. Example: collection with ef_construct=256 maintains 95%+ recall with ef_search=64, while ef_construct=100 drops to 85% recall at same query setting.
Use smaller batch sizes with parallel processing. Default REST API limit is 32MB; gRPC has configurable limits via grpc_options. Solution: batch into chunks of 1000-2000 points using qdrant_client.upload_points(collection_name='collection', points=[...], batch_size=1000, parallel=4). The upload_points method automatically handles batching and parallel uploads. For 10k vectors with 1536 dimensions: each vector ~6KB (1536 * 4 bytes), plus metadata. Batch of 1000 = ~6MB (safe margin). Alternative: use upload_collection for larger datasets with automatic intelligent batching. Avoid single upsert calls with >2000 high-dimensional vectors. Configure larger gRPC limits if needed: grpc_options={'grpc.max_send_message_length': -1, 'grpc.max_receive_message_length': -1}. gRPC compression can reduce payload but may slow performance with large strings - test before enabling in production.
Configure compression in qdrant_client initialization using grpc_compression parameter: from grpc import Compression; client = QdrantClient(host='localhost', grpc_port=6334, prefer_grpc=True, grpc_compression=Compression.Gzip). Alternatively use grpc_options: QdrantClient(url='localhost:6334', grpc_options={'grpc.default_compression_algorithm': 'gzip', 'grpc.default_compression_level': 3}). Compression levels: 0=none, 1-3=low, 4-6=medium, 7-9=high. For batch upserts use level 3-4 (balanced compression/speed). Warning: benchmarks show gzip compression can slow queries with large payloads (strings) - disable for performance-critical reads. Best practice: enable compression for write-heavy workflows (batch ingestion) but disable for read-heavy production queries. Test throughput with compression on/off for your payload size. Alternative: use upload_points batch_size optimization instead of compression for better performance.
Configure WAL parameters during collection creation: qdrant_client.create_collection(collection_name='fast_ingest', vectors_config=VectorParams(size=384, distance=Distance.COSINE), wal_config=models.WalConfigDiff(wal_capacity_mb=1, wal_segments_ahead=0)). Setting wal_capacity_mb to minimum (1MB) and wal_segments_ahead=0 reduces WAL overhead. Production recommendation (May 2025): wal_capacity_mb=64 and wal_segments_ahead=1 for pre-allocated speed with acceptable recovery time. More effective: disable HNSW indexing during bulk ingestion with hnsw_config=models.HnswConfigDiff(m=0) and set indexing_threshold=0. After ingestion complete, update collection with positive indexing_threshold to build index. This prevents memory spikes and speeds ingestion 3-5x. WAL cannot be fully disabled (ensures durability), but minimal WAL + disabled indexing optimizes bulk uploads.
Set memmap_threshold (formerly memmap_threshold_kb) to control RAM vs disk storage. For 50M vectors, use low threshold: qdrant_client.update_collection(collection_name='large_collection', optimizer_config=models.OptimizersConfigDiff(memmap_threshold=20000)). Value in KB: 1KB = 1 vector of 256 dimensions. For 384-dim vectors with 50M count: estimated 7.2GB raw data. Set memmap_threshold=20000 (20MB) to force segments >20MB to disk (memory-mapped). Lower values = less RAM, slightly higher query latency for cold reads. Default 200000KB keeps most data in RAM. Production recommendation for 50M+ vectors: 10000-30000KB depending on available RAM. Combine with on_disk: true during collection creation for maximum memory savings.
Use the hnsw_ef parameter in search_params during query: client.search(collection_name='collection', query_vector=[...], limit=10, search_params=models.SearchParams(hnsw_ef=128)). The ef_construct (set at index creation) controls graph quality, while hnsw_ef (query-time) controls search breadth. Higher hnsw_ef = better recall but slower queries. Typical values: hnsw_ef=64-512. Default hnsw_ef equals ef_construct if not specified. For production: set lower hnsw_ef (64-128) for speed, higher (256-512) for maximum recall. Example: collection built with ef_construct=200 can query with hnsw_ef=64 (fast, ~90% recall) or hnsw_ef=256 (slower, 98%+ recall). Adjust hnsw_ef per query based on speed vs accuracy requirements.
Create payload index on filtered fields: client.create_payload_index(collection_name='collection', field_name='status', field_schema=models.PayloadSchemaType.KEYWORD). For nested fields use dot notation: 'metadata.category'. Index types: KEYWORD (strings, enums), INTEGER (numbers), GEO (coordinates), TEXT (full-text search), FLOAT (ranges), BOOL (true/false). Unindexed filters trigger full collection scans - slow for large datasets. With payload index: Qdrant's query planner uses index directly for low-cardinality filters, bypassing HNSW entirely (10x+ speedup). Example: filtering 1M points by status='active' drops from 2000ms to 150ms. Create indexes on all frequently filtered fields. Check with collection_info to verify indexes exist before production deployment.
Default JSON payload limit is 32MB (33554432 bytes) for API requests. Error occurs when single batch exceeds this limit. Causes: (1) Too many points in batch (reduce from 1000 to 100-500), (2) Large metadata per point (documents >100k words), (3) Multiple embedding types per point. Solutions: Reduce batch size with upload_points(..., batch_size=100). Chunk large documents into smaller pieces before embedding. Remove unnecessary metadata fields. For large payloads, configure Qdrant's max_request_size_mb in config.yaml (increase from default 32MB). Use payload_on_disk=True in collection config for large payloads. Check RAM availability - insufficient memory also triggers similar errors. Use gRPC interface instead of REST for better handling of large payloads. Production best practice: keep metadata <10KB per point, batch size 100-500 for high-dimensional vectors. Note: payload storage size has no hard limit, only request size is limited.
Enable cluster mode with proper configuration: Set QDRANT__CLUSTER__ENABLED=true environment variable. Ensure all nodes run Qdrant v1.7.4+ (critical for resilience, v1.9.0+ for automatic wal_delta recovery). Each node must have separate storage directory - shared storage causes shard transfer failures. Configure internal port 6335 for cluster communication (isolate from external access). Shard transfer methods: 'stream_records' (default), 'snapshot' (for large collections), 'wal_delta' (automatic in v1.9.0+ for dead shard recovery). Common 2025 failures: consensus errors (no transfer for shard X from N to M), dead shards preventing cluster queries, snapshot compatibility mismatches between shard configurations. Solutions: Ensure same Qdrant version across all nodes, verify separate storage per node, monitor network latency <10ms between nodes. Use replication_factor=2+ for high availability. Monitor /cluster endpoint for shard health. Avoid snapshot restore from different shard configurations (e.g., 1 shard to 3 shards fails).
Configure scalar quantization during collection creation: client.create_collection(collection_name='optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE), quantization_config=models.ScalarQuantization(scalar=models.ScalarQuantizationConfig(type=models.ScalarType.INT8, quantile=0.99, always_ram=True))). This converts float32 (4 bytes) to int8 (1 byte) per dimension = 4x memory reduction. Key parameters: type=INT8 (8-bit quantization), quantile=0.99 (outlier handling - keeps 99% of values, clips extremes), always_ram=True (keeps quantized vectors in RAM for speed). Performance: 2x faster search + 4x less memory with minimal accuracy loss (<1% recall degradation). Use quantile=0.99 for general data, 0.95 for cleaner distributions. Recommended as 'safe default choice' for production deployments. Can be combined with on_disk storage for even greater memory savings.
Configure binary quantization during collection creation: client.create_collection(collection_name='binary_optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE), quantization_config=models.BinaryQuantization(binary=models.BinaryQuantizationConfig(always_ram=True))). Converts each float32 dimension to 1 bit = 32x memory reduction (1536 float32 = 6KB → 192 bytes). Trade-off: ~5-10% recall loss vs unquantized. Use oversampling to compensate: search_params=models.SearchParams(quantization=models.QuantizationSearchParams(rescore=True, oversampling=2.0)) retrieves 2x candidates, rescores with original vectors. Binary quantization recommended for: >1M vectors, memory-constrained deployments, high-dimensional embeddings (>768). Not recommended for: <1K dimensions, critical accuracy requirements. Qdrant v1.15+ supports 1.5-bit and 2-bit binary quantization for improved precision.
Use binary quantization for stored vectors with asymmetric query encoding: quantization_config=models.BinaryQuantization(binary=models.BinaryQuantizationConfig(always_ram=True, query_encoding='scalar8bits')). Available in Qdrant 1.15+. Query encoding options: 'default' (same as storage), 'binary', 'scalar8bits' (8-bit quantization), 'scalar4bits' (4-bit quantization). Asymmetric approach: stored vectors use 1 bit/dimension (32x compression), query vector uses 8 bits (int8) for better precision during comparison. Benefits: binary storage size + scalar query accuracy, ideal for disk I/O bottlenecks (millions of vectors), requires less rescoring for same quality output. Use oversampling + rescore: search_params=models.SearchParams(quantization=models.QuantizationSearchParams(rescore=True, oversampling=3.0)) for <2% recall loss. Recommended for: >10M vectors, low/medium-dimensional vectors (512-1024 dims), SSD storage, memory <32GB. Qdrant 1.15 also introduced 1.5-bit and 2-bit binary quantization for improved precision.
Create payload index with is_tenant=True for tenant field: client.create_payload_index(collection_name='multi_tenant', field_name='tenant_id', field_schema=models.PayloadSchemaType.KEYWORD, is_tenant=True). Available since v1.11.0. This optimizes index for multi-tenant filtering patterns. Upsert with tenant_id: client.upsert(collection_name='multi_tenant', points=[models.PointStruct(id=1, vector=[...], payload={'tenant_id': 'user123', 'data': '...'})]). Query with filter: client.search(collection_name='multi_tenant', query_vector=[...], query_filter=models.Filter(must=[models.FieldCondition(key='tenant_id', match=models.MatchValue(value='user123'))])). Benefits: Qdrant optimizes storage/retrieval per tenant, faster than regular keyword index, supports millions of tenants in single collection. Production: combine with sharding_method='custom' and shard_key='tenant_id' for tenant isolation across shards.
Use payload-based partitioning (recommended): Single collection + tenant_id field for >100 tenants, shared resource pool, cost-effective. Implementation: filter by tenant_id with is_tenant=True index. Benefits: efficient resource usage, simpler management, unlimited tenants. Use separate collections: <100 tenants needing strict isolation, different vector configs per tenant (dimensions, distance metrics), regulatory compliance requiring physical separation. Trade-offs: Collection approach = higher overhead (HNSW graph per collection), more memory, complex scaling. Payload approach = shared HNSW graph, single resource pool, 10x+ more tenants. Production decision tree: <10 tenants + strict isolation = collections, >100 tenants + shared config = payload-based, hybrid needs = shard_key partitioning (v1.7+). Benchmark: payload-based handles 10K+ tenants efficiently, collections limited to ~1K due to memory overhead.
Create collection snapshot: client.create_snapshot(collection_name='my_collection'). Returns snapshot name: 'my_collection-2025-11-15-12-30-00.snapshot'. Download: curl 'http://localhost:6333/collections/my_collection/snapshots/snapshot_name' --output snapshot.bin. Restore on target cluster: curl -X PUT 'http://new_cluster:6333/collections/my_collection/snapshots/upload' --data-binary @snapshot.bin. Or via client: client.recover_snapshot(collection_name='my_collection', snapshot_path='/path/to/snapshot.bin'). Snapshots include: vectors, payloads, indexes, collection config. Version compatibility: Qdrant v1.4.x snapshots only restore to v1.4.x (same minor version). For S3 storage: curl with pre-signed URL (workaround for direct S3 restore). Use snapshots for: cross-cluster migration, testing, backups. Not for: disaster recovery (use full backups instead). Snapshot time: ~1-5 minutes per 1M vectors.
Snapshots: Collection-level logical copies (vectors + metadata). Use for: data migration, testing, cross-cluster transfers. Creation: API-triggered, on-demand. Limitations: same Qdrant version required (v1.4.x → v1.4.x only), single collection scope, manual process. Backups: Physical disk-level copies of entire cluster. Use for: disaster recovery, point-in-time restore, production resilience. Qdrant Cloud backups: incremental (AWS/GCP), automatic daily schedule, cluster-wide. Recommendation: Use backups for disaster recovery, snapshots for data movement. Production setup: Enable automatic backups (Qdrant Cloud) + periodic snapshots for migration readiness. Backup restore: full cluster state including configs, snapshots restore: individual collections. Testing: Always verify snapshot/backup restore works before relying on it. Retention: Qdrant Cloud keeps 30-day backup history by default.
Set shard_number during collection creation: client.create_collection(collection_name='distributed', vectors_config=VectorParams(size=384, distance=Distance.COSINE), shard_number=6, replication_factor=2). Points distributed via consistent hashing - each shard manages non-overlapping subset. Recommendation: shard_number = multiple of cluster nodes (6 shards for 3 nodes = 2 shards/node). Minimum: 1 shard (single-node), maximum: no hard limit (tested to 100+). Replication_factor: 2-3 for production (high availability). Shards cannot split across nodes - shard is atomic unit. For existing collections: use resharding (v1.7+): client.update_collection(collection_name='collection', shard_number_target=12). Resharding is online operation - no downtime. Benefits: horizontal scaling, >30GB datasets see 3.5x speedup. Optimal: 4 nodes with 8-12 shards total for balanced performance vs communication overhead.
Set replication_factor during collection creation: client.create_collection(collection_name='ha_collection', vectors_config=VectorParams(size=768, distance=Distance.COSINE), shard_number=3, replication_factor=2). Each shard replicated to 2 nodes (primary + 1 replica). Minimum for HA: replication_factor=2 (tolerates 1 node failure). Recommended for production: replication_factor=3 (tolerates 2 failures). Write behavior: client waits for majority replicas to confirm (write_consistency_factor, default=majority). Read behavior: queries load-balanced across replicas (faster search under load). Update existing collection: client.update_collection(collection_name='collection', replication_factor=2) triggers replica creation. Storage overhead: replication_factor=2 doubles storage. Network: replicas sync via internal port 6335. Benefits: zero-downtime deployments, fault tolerance, 2x read throughput. Trade-offs: write latency +10-20%, storage × replication_factor. Production: 3-node cluster, replication_factor=2, shard_number=6 for optimal HA.
Configure multi-vector collection: client.create_collection(collection_name='hybrid', vectors_config={'dense': models.VectorParams(size=384, distance=Distance.COSINE), 'sparse': models.VectorParams(size=1000, distance=Distance.COSINE, sparse_vectors=models.SparseVectorParams())}). Upsert with both vectors: client.upsert(collection_name='hybrid', points=[models.PointStruct(id=1, vector={'dense': dense_embedding, 'sparse': sparse_embedding}, payload={})]). Query with Query API (v1.10+) and fusion: results = client.query_points(collection_name='hybrid', prefetch=[models.Prefetch(query=dense_vec, using='dense', limit=20), models.Prefetch(query=sparse_vec, using='sparse', limit=20)], query=models.FusionQuery(fusion=models.Fusion.RRF)). Fusion methods: RRF (Reciprocal Rank Fusion - position-based) or DBSF (Distribution-Based Score Fusion, v1.11.0+ - score normalization). Dense vector: semantic (all-MiniLM-L6-v2), sparse vector: BM25/SPLADE. Benefits: 15-25% better recall than dense-only. Production: use BM25 for sparse, tune fusion via prefetch limits, DBSF for better score distribution.
Create text payload index: client.create_payload_index(collection_name='docs', field_name='content', field_schema=models.TextIndexParams(type='text', tokenizer='word', min_token_len=2, max_token_len=20, lowercase=True)). Available since v1.1.0. Tokenizers: 'word' (whitespace split), 'whitespace', 'prefix', 'multilingual'. Query with text match: client.search(collection_name='docs', query_vector=vec, query_filter=models.Filter(must=[models.FieldCondition(key='content', match=models.MatchText(text='machine learning'))])). Full-text index uses BM25-like scoring. Benefits: keyword filtering on payload text fields, supports partial matches with prefix tokenizer, multilingual support. Use cases: filter by document keywords before vector search, hybrid retrieval (vector + keyword). Limitations: not a replacement for full-text search engines (Elasticsearch), basic tokenization only. Production: combine with vector search for hybrid approach, index all searchable text fields.
Monitor via /metrics endpoint (Prometheus/OpenMetrics format at port 6333): key metrics: (1) app_info (version, startup time), (2) app_status_recovery_mode (0=healthy, 1=recovering), (3) collections_total (collection count), (4) collections_vector_total (total vectors), (5) cluster_enabled (cluster mode status), (6) rest_responses_total (API requests by status code), (7) rest_responses_duration_seconds (p50, p95, p99 latency). Storage metrics: storage_total_size_bytes, storage_mmap_size_bytes, storage_ram_size_bytes. Collection-level: qdrant_collection_segments (segment count), qdrant_collection_vectors (vectors per collection). Setup: Prometheus scrape http://qdrant:6333/metrics every 15s. For Qdrant Cloud: use /sys_metrics endpoint for additional infrastructure data (load balancers, ingresses, cluster workloads). Alert on: app_status_recovery_mode=1, rest_responses_total{code='5xx'} spike, rest_responses_duration_seconds p99 >1s. Grafana dashboard: import Qdrant official template from GitHub. Production: enable telemetry for detailed traces, set log level=INFO.
Set API key via environment variable: QDRANT__SERVICE__API_KEY=your_secret_key_here. Start Qdrant with config: docker run -e QDRANT__SERVICE__API_KEY=my_secret_key -p 6333:6333 qdrant/qdrant. Client usage: client = QdrantClient(url='https://qdrant.example.com', api_key='my_secret_key'). API key sent in header: api-key: my_secret_key. Production best practices: (1) Use strong random keys (32+ characters), (2) Rotate keys periodically, (3) Store in secrets manager (AWS Secrets Manager, HashiCorp Vault), (4) Enable TLS/HTTPS (required for API key security), (5) Network isolation (firewall port 6333 to trusted IPs only). Qdrant Cloud: read-only vs read-write API keys supported. For internal deployments: combine API key + mTLS for defense-in-depth. Monitor failed auth attempts via rest_responses_total{code='401'}. Disable API key for local dev only.
Enable on-disk storage during collection creation: client.create_collection(collection_name='cost_optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE, on_disk=True), optimizers_config=models.OptimizersConfigDiff(memmap_threshold=10000)). Vectors stored on SSD, memory-mapped for access. Benefits: 10x cost reduction (RAM vs SSD), scales to billions of vectors. Trade-offs: 2-5x slower queries (SSD latency vs RAM). Combine with quantization for maximum savings: on_disk=True + scalar quantization = 40x memory reduction. memmap_threshold=10000 (10MB) forces segments >10MB to disk. Production setup: NVMe SSD for <50ms p95 latency, keep payload indexes in RAM for filtering performance. Use cases: large archives, cost-sensitive deployments, >100M vectors. Not recommended for: latency-critical apps (<10ms SLA), high QPS (>1000/s). Qdrant Cloud: on_disk reduces costs 5-10x vs in-memory tier. Monitor: storage_mmap_size_bytes vs storage_ram_size_bytes.
Migration strategy: (1) Setup: Deploy Qdrant cluster (self-hosted or Cloud), create collection matching Pinecone config: dimension, distance metric (cosine/euclidean/dot). (2) Use official tool: Qdrant provides Docker-based migration tool (github.com/qdrant/migration) supporting Pinecone, Chroma, Weaviate with resumable transfers. Alternative: Manual migration. (3) Dual-write phase: Write new vectors to both Pinecone + Qdrant for 1-2 weeks. (4) Backfill: Export Pinecone data via list() + fetch(), batch upsert to Qdrant with upload_points(batch_size=1000, parallel=4). Script: import pinecone; pc_index = pinecone.Index('index'); for ids in batches: vectors = pc_index.fetch(ids); qdrant_client.upload_points(collection_name='migrated', points=[models.PointStruct(id=v['id'], vector=v['values'], payload=v['metadata']) for v in vectors]). (5) Validation: Compare query results (same vector, both systems, diff<1%). (6) Cutover: Route 100% reads to Qdrant, stop Pinecone writes. (7) Cleanup: Delete Pinecone index after 7-day safety period. Key differences: Pinecone namespaces → Qdrant filtering (no direct namespace concept), IDs as UUIDs vs strings. 2025 results: 1s faster API times, 2.5s → 1s total time improvements.
Set HNSW parameters with higher values for maximum recall: m=32-64 (graph connections) and ef_construct=200-512 (build-time neighbors). Example: qdrant_client.create_collection(collection_name='high_recall', vectors_config=models.VectorParams(size=1536, distance=Distance.COSINE), hnsw_config=models.HnswConfigDiff(m=64, ef_construct=512)). Combine with search-time hnsw_ef=128-256 for 99%+ recall. Trade-offs: higher m increases memory (64 connections vs default 16), higher ef_construct increases indexing time. For 99% recall: use m=32 minimum with ef_construct=200+ for balanced builds, or m=64 with ef_construct=512 for maximum precision. Documented case shows 99.95% recall achievable with proper tuning.
Set higher ef_construct during index creation to maintain recall when using lower ef_search values. Configuration: models.HnswConfigDiff(ef_construct=256). The ef_construct (build-time) parameter determines graph quality - higher values create denser connections that maintain accuracy even with lower ef_search (query-time). Default ef_construct=100 causes recall drops with ef_search<128. For production: use ef_construct=200-256 minimum when queries use ef_search<128. The ef_construct sets upper bound on achievable recall regardless of query-time ef value. Example: collection with ef_construct=256 maintains 95%+ recall with ef_search=64, while ef_construct=100 drops to 85% recall at same query setting.
Use smaller batch sizes with parallel processing. Default REST API limit is 32MB; gRPC has configurable limits via grpc_options. Solution: batch into chunks of 1000-2000 points using qdrant_client.upload_points(collection_name='collection', points=[...], batch_size=1000, parallel=4). The upload_points method automatically handles batching and parallel uploads. For 10k vectors with 1536 dimensions: each vector ~6KB (1536 * 4 bytes), plus metadata. Batch of 1000 = ~6MB (safe margin). Alternative: use upload_collection for larger datasets with automatic intelligent batching. Avoid single upsert calls with >2000 high-dimensional vectors. Configure larger gRPC limits if needed: grpc_options={'grpc.max_send_message_length': -1, 'grpc.max_receive_message_length': -1}. gRPC compression can reduce payload but may slow performance with large strings - test before enabling in production.
Configure compression in qdrant_client initialization using grpc_compression parameter: from grpc import Compression; client = QdrantClient(host='localhost', grpc_port=6334, prefer_grpc=True, grpc_compression=Compression.Gzip). Alternatively use grpc_options: QdrantClient(url='localhost:6334', grpc_options={'grpc.default_compression_algorithm': 'gzip', 'grpc.default_compression_level': 3}). Compression levels: 0=none, 1-3=low, 4-6=medium, 7-9=high. For batch upserts use level 3-4 (balanced compression/speed). Warning: benchmarks show gzip compression can slow queries with large payloads (strings) - disable for performance-critical reads. Best practice: enable compression for write-heavy workflows (batch ingestion) but disable for read-heavy production queries. Test throughput with compression on/off for your payload size. Alternative: use upload_points batch_size optimization instead of compression for better performance.
Configure WAL parameters during collection creation: qdrant_client.create_collection(collection_name='fast_ingest', vectors_config=VectorParams(size=384, distance=Distance.COSINE), wal_config=models.WalConfigDiff(wal_capacity_mb=1, wal_segments_ahead=0)). Setting wal_capacity_mb to minimum (1MB) and wal_segments_ahead=0 reduces WAL overhead. Production recommendation (May 2025): wal_capacity_mb=64 and wal_segments_ahead=1 for pre-allocated speed with acceptable recovery time. More effective: disable HNSW indexing during bulk ingestion with hnsw_config=models.HnswConfigDiff(m=0) and set indexing_threshold=0. After ingestion complete, update collection with positive indexing_threshold to build index. This prevents memory spikes and speeds ingestion 3-5x. WAL cannot be fully disabled (ensures durability), but minimal WAL + disabled indexing optimizes bulk uploads.
Set memmap_threshold (formerly memmap_threshold_kb) to control RAM vs disk storage. For 50M vectors, use low threshold: qdrant_client.update_collection(collection_name='large_collection', optimizer_config=models.OptimizersConfigDiff(memmap_threshold=20000)). Value in KB: 1KB = 1 vector of 256 dimensions. For 384-dim vectors with 50M count: estimated 7.2GB raw data. Set memmap_threshold=20000 (20MB) to force segments >20MB to disk (memory-mapped). Lower values = less RAM, slightly higher query latency for cold reads. Default 200000KB keeps most data in RAM. Production recommendation for 50M+ vectors: 10000-30000KB depending on available RAM. Combine with on_disk: true during collection creation for maximum memory savings.
Use the hnsw_ef parameter in search_params during query: client.search(collection_name='collection', query_vector=[...], limit=10, search_params=models.SearchParams(hnsw_ef=128)). The ef_construct (set at index creation) controls graph quality, while hnsw_ef (query-time) controls search breadth. Higher hnsw_ef = better recall but slower queries. Typical values: hnsw_ef=64-512. Default hnsw_ef equals ef_construct if not specified. For production: set lower hnsw_ef (64-128) for speed, higher (256-512) for maximum recall. Example: collection built with ef_construct=200 can query with hnsw_ef=64 (fast, ~90% recall) or hnsw_ef=256 (slower, 98%+ recall). Adjust hnsw_ef per query based on speed vs accuracy requirements.
Create payload index on filtered fields: client.create_payload_index(collection_name='collection', field_name='status', field_schema=models.PayloadSchemaType.KEYWORD). For nested fields use dot notation: 'metadata.category'. Index types: KEYWORD (strings, enums), INTEGER (numbers), GEO (coordinates), TEXT (full-text search), FLOAT (ranges), BOOL (true/false). Unindexed filters trigger full collection scans - slow for large datasets. With payload index: Qdrant's query planner uses index directly for low-cardinality filters, bypassing HNSW entirely (10x+ speedup). Example: filtering 1M points by status='active' drops from 2000ms to 150ms. Create indexes on all frequently filtered fields. Check with collection_info to verify indexes exist before production deployment.
Default JSON payload limit is 32MB (33554432 bytes) for API requests. Error occurs when single batch exceeds this limit. Causes: (1) Too many points in batch (reduce from 1000 to 100-500), (2) Large metadata per point (documents >100k words), (3) Multiple embedding types per point. Solutions: Reduce batch size with upload_points(..., batch_size=100). Chunk large documents into smaller pieces before embedding. Remove unnecessary metadata fields. For large payloads, configure Qdrant's max_request_size_mb in config.yaml (increase from default 32MB). Use payload_on_disk=True in collection config for large payloads. Check RAM availability - insufficient memory also triggers similar errors. Use gRPC interface instead of REST for better handling of large payloads. Production best practice: keep metadata <10KB per point, batch size 100-500 for high-dimensional vectors. Note: payload storage size has no hard limit, only request size is limited.
Enable cluster mode with proper configuration: Set QDRANT__CLUSTER__ENABLED=true environment variable. Ensure all nodes run Qdrant v1.7.4+ (critical for resilience, v1.9.0+ for automatic wal_delta recovery). Each node must have separate storage directory - shared storage causes shard transfer failures. Configure internal port 6335 for cluster communication (isolate from external access). Shard transfer methods: 'stream_records' (default), 'snapshot' (for large collections), 'wal_delta' (automatic in v1.9.0+ for dead shard recovery). Common 2025 failures: consensus errors (no transfer for shard X from N to M), dead shards preventing cluster queries, snapshot compatibility mismatches between shard configurations. Solutions: Ensure same Qdrant version across all nodes, verify separate storage per node, monitor network latency <10ms between nodes. Use replication_factor=2+ for high availability. Monitor /cluster endpoint for shard health. Avoid snapshot restore from different shard configurations (e.g., 1 shard to 3 shards fails).
Configure scalar quantization during collection creation: client.create_collection(collection_name='optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE), quantization_config=models.ScalarQuantization(scalar=models.ScalarQuantizationConfig(type=models.ScalarType.INT8, quantile=0.99, always_ram=True))). This converts float32 (4 bytes) to int8 (1 byte) per dimension = 4x memory reduction. Key parameters: type=INT8 (8-bit quantization), quantile=0.99 (outlier handling - keeps 99% of values, clips extremes), always_ram=True (keeps quantized vectors in RAM for speed). Performance: 2x faster search + 4x less memory with minimal accuracy loss (<1% recall degradation). Use quantile=0.99 for general data, 0.95 for cleaner distributions. Recommended as 'safe default choice' for production deployments. Can be combined with on_disk storage for even greater memory savings.
Configure binary quantization during collection creation: client.create_collection(collection_name='binary_optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE), quantization_config=models.BinaryQuantization(binary=models.BinaryQuantizationConfig(always_ram=True))). Converts each float32 dimension to 1 bit = 32x memory reduction (1536 float32 = 6KB → 192 bytes). Trade-off: ~5-10% recall loss vs unquantized. Use oversampling to compensate: search_params=models.SearchParams(quantization=models.QuantizationSearchParams(rescore=True, oversampling=2.0)) retrieves 2x candidates, rescores with original vectors. Binary quantization recommended for: >1M vectors, memory-constrained deployments, high-dimensional embeddings (>768). Not recommended for: <1K dimensions, critical accuracy requirements. Qdrant v1.15+ supports 1.5-bit and 2-bit binary quantization for improved precision.
Use binary quantization for stored vectors with asymmetric query encoding: quantization_config=models.BinaryQuantization(binary=models.BinaryQuantizationConfig(always_ram=True, query_encoding='scalar8bits')). Available in Qdrant 1.15+. Query encoding options: 'default' (same as storage), 'binary', 'scalar8bits' (8-bit quantization), 'scalar4bits' (4-bit quantization). Asymmetric approach: stored vectors use 1 bit/dimension (32x compression), query vector uses 8 bits (int8) for better precision during comparison. Benefits: binary storage size + scalar query accuracy, ideal for disk I/O bottlenecks (millions of vectors), requires less rescoring for same quality output. Use oversampling + rescore: search_params=models.SearchParams(quantization=models.QuantizationSearchParams(rescore=True, oversampling=3.0)) for <2% recall loss. Recommended for: >10M vectors, low/medium-dimensional vectors (512-1024 dims), SSD storage, memory <32GB. Qdrant 1.15 also introduced 1.5-bit and 2-bit binary quantization for improved precision.
Create payload index with is_tenant=True for tenant field: client.create_payload_index(collection_name='multi_tenant', field_name='tenant_id', field_schema=models.PayloadSchemaType.KEYWORD, is_tenant=True). Available since v1.11.0. This optimizes index for multi-tenant filtering patterns. Upsert with tenant_id: client.upsert(collection_name='multi_tenant', points=[models.PointStruct(id=1, vector=[...], payload={'tenant_id': 'user123', 'data': '...'})]). Query with filter: client.search(collection_name='multi_tenant', query_vector=[...], query_filter=models.Filter(must=[models.FieldCondition(key='tenant_id', match=models.MatchValue(value='user123'))])). Benefits: Qdrant optimizes storage/retrieval per tenant, faster than regular keyword index, supports millions of tenants in single collection. Production: combine with sharding_method='custom' and shard_key='tenant_id' for tenant isolation across shards.
Use payload-based partitioning (recommended): Single collection + tenant_id field for >100 tenants, shared resource pool, cost-effective. Implementation: filter by tenant_id with is_tenant=True index. Benefits: efficient resource usage, simpler management, unlimited tenants. Use separate collections: <100 tenants needing strict isolation, different vector configs per tenant (dimensions, distance metrics), regulatory compliance requiring physical separation. Trade-offs: Collection approach = higher overhead (HNSW graph per collection), more memory, complex scaling. Payload approach = shared HNSW graph, single resource pool, 10x+ more tenants. Production decision tree: <10 tenants + strict isolation = collections, >100 tenants + shared config = payload-based, hybrid needs = shard_key partitioning (v1.7+). Benchmark: payload-based handles 10K+ tenants efficiently, collections limited to ~1K due to memory overhead.
Create collection snapshot: client.create_snapshot(collection_name='my_collection'). Returns snapshot name: 'my_collection-2025-11-15-12-30-00.snapshot'. Download: curl 'http://localhost:6333/collections/my_collection/snapshots/snapshot_name' --output snapshot.bin. Restore on target cluster: curl -X PUT 'http://new_cluster:6333/collections/my_collection/snapshots/upload' --data-binary @snapshot.bin. Or via client: client.recover_snapshot(collection_name='my_collection', snapshot_path='/path/to/snapshot.bin'). Snapshots include: vectors, payloads, indexes, collection config. Version compatibility: Qdrant v1.4.x snapshots only restore to v1.4.x (same minor version). For S3 storage: curl with pre-signed URL (workaround for direct S3 restore). Use snapshots for: cross-cluster migration, testing, backups. Not for: disaster recovery (use full backups instead). Snapshot time: ~1-5 minutes per 1M vectors.
Snapshots: Collection-level logical copies (vectors + metadata). Use for: data migration, testing, cross-cluster transfers. Creation: API-triggered, on-demand. Limitations: same Qdrant version required (v1.4.x → v1.4.x only), single collection scope, manual process. Backups: Physical disk-level copies of entire cluster. Use for: disaster recovery, point-in-time restore, production resilience. Qdrant Cloud backups: incremental (AWS/GCP), automatic daily schedule, cluster-wide. Recommendation: Use backups for disaster recovery, snapshots for data movement. Production setup: Enable automatic backups (Qdrant Cloud) + periodic snapshots for migration readiness. Backup restore: full cluster state including configs, snapshots restore: individual collections. Testing: Always verify snapshot/backup restore works before relying on it. Retention: Qdrant Cloud keeps 30-day backup history by default.
Set shard_number during collection creation: client.create_collection(collection_name='distributed', vectors_config=VectorParams(size=384, distance=Distance.COSINE), shard_number=6, replication_factor=2). Points distributed via consistent hashing - each shard manages non-overlapping subset. Recommendation: shard_number = multiple of cluster nodes (6 shards for 3 nodes = 2 shards/node). Minimum: 1 shard (single-node), maximum: no hard limit (tested to 100+). Replication_factor: 2-3 for production (high availability). Shards cannot split across nodes - shard is atomic unit. For existing collections: use resharding (v1.7+): client.update_collection(collection_name='collection', shard_number_target=12). Resharding is online operation - no downtime. Benefits: horizontal scaling, >30GB datasets see 3.5x speedup. Optimal: 4 nodes with 8-12 shards total for balanced performance vs communication overhead.
Set replication_factor during collection creation: client.create_collection(collection_name='ha_collection', vectors_config=VectorParams(size=768, distance=Distance.COSINE), shard_number=3, replication_factor=2). Each shard replicated to 2 nodes (primary + 1 replica). Minimum for HA: replication_factor=2 (tolerates 1 node failure). Recommended for production: replication_factor=3 (tolerates 2 failures). Write behavior: client waits for majority replicas to confirm (write_consistency_factor, default=majority). Read behavior: queries load-balanced across replicas (faster search under load). Update existing collection: client.update_collection(collection_name='collection', replication_factor=2) triggers replica creation. Storage overhead: replication_factor=2 doubles storage. Network: replicas sync via internal port 6335. Benefits: zero-downtime deployments, fault tolerance, 2x read throughput. Trade-offs: write latency +10-20%, storage × replication_factor. Production: 3-node cluster, replication_factor=2, shard_number=6 for optimal HA.
Configure multi-vector collection: client.create_collection(collection_name='hybrid', vectors_config={'dense': models.VectorParams(size=384, distance=Distance.COSINE), 'sparse': models.VectorParams(size=1000, distance=Distance.COSINE, sparse_vectors=models.SparseVectorParams())}). Upsert with both vectors: client.upsert(collection_name='hybrid', points=[models.PointStruct(id=1, vector={'dense': dense_embedding, 'sparse': sparse_embedding}, payload={})]). Query with Query API (v1.10+) and fusion: results = client.query_points(collection_name='hybrid', prefetch=[models.Prefetch(query=dense_vec, using='dense', limit=20), models.Prefetch(query=sparse_vec, using='sparse', limit=20)], query=models.FusionQuery(fusion=models.Fusion.RRF)). Fusion methods: RRF (Reciprocal Rank Fusion - position-based) or DBSF (Distribution-Based Score Fusion, v1.11.0+ - score normalization). Dense vector: semantic (all-MiniLM-L6-v2), sparse vector: BM25/SPLADE. Benefits: 15-25% better recall than dense-only. Production: use BM25 for sparse, tune fusion via prefetch limits, DBSF for better score distribution.
Create text payload index: client.create_payload_index(collection_name='docs', field_name='content', field_schema=models.TextIndexParams(type='text', tokenizer='word', min_token_len=2, max_token_len=20, lowercase=True)). Available since v1.1.0. Tokenizers: 'word' (whitespace split), 'whitespace', 'prefix', 'multilingual'. Query with text match: client.search(collection_name='docs', query_vector=vec, query_filter=models.Filter(must=[models.FieldCondition(key='content', match=models.MatchText(text='machine learning'))])). Full-text index uses BM25-like scoring. Benefits: keyword filtering on payload text fields, supports partial matches with prefix tokenizer, multilingual support. Use cases: filter by document keywords before vector search, hybrid retrieval (vector + keyword). Limitations: not a replacement for full-text search engines (Elasticsearch), basic tokenization only. Production: combine with vector search for hybrid approach, index all searchable text fields.
Monitor via /metrics endpoint (Prometheus/OpenMetrics format at port 6333): key metrics: (1) app_info (version, startup time), (2) app_status_recovery_mode (0=healthy, 1=recovering), (3) collections_total (collection count), (4) collections_vector_total (total vectors), (5) cluster_enabled (cluster mode status), (6) rest_responses_total (API requests by status code), (7) rest_responses_duration_seconds (p50, p95, p99 latency). Storage metrics: storage_total_size_bytes, storage_mmap_size_bytes, storage_ram_size_bytes. Collection-level: qdrant_collection_segments (segment count), qdrant_collection_vectors (vectors per collection). Setup: Prometheus scrape http://qdrant:6333/metrics every 15s. For Qdrant Cloud: use /sys_metrics endpoint for additional infrastructure data (load balancers, ingresses, cluster workloads). Alert on: app_status_recovery_mode=1, rest_responses_total{code='5xx'} spike, rest_responses_duration_seconds p99 >1s. Grafana dashboard: import Qdrant official template from GitHub. Production: enable telemetry for detailed traces, set log level=INFO.
Set API key via environment variable: QDRANT__SERVICE__API_KEY=your_secret_key_here. Start Qdrant with config: docker run -e QDRANT__SERVICE__API_KEY=my_secret_key -p 6333:6333 qdrant/qdrant. Client usage: client = QdrantClient(url='https://qdrant.example.com', api_key='my_secret_key'). API key sent in header: api-key: my_secret_key. Production best practices: (1) Use strong random keys (32+ characters), (2) Rotate keys periodically, (3) Store in secrets manager (AWS Secrets Manager, HashiCorp Vault), (4) Enable TLS/HTTPS (required for API key security), (5) Network isolation (firewall port 6333 to trusted IPs only). Qdrant Cloud: read-only vs read-write API keys supported. For internal deployments: combine API key + mTLS for defense-in-depth. Monitor failed auth attempts via rest_responses_total{code='401'}. Disable API key for local dev only.
Enable on-disk storage during collection creation: client.create_collection(collection_name='cost_optimized', vectors_config=VectorParams(size=1536, distance=Distance.COSINE, on_disk=True), optimizers_config=models.OptimizersConfigDiff(memmap_threshold=10000)). Vectors stored on SSD, memory-mapped for access. Benefits: 10x cost reduction (RAM vs SSD), scales to billions of vectors. Trade-offs: 2-5x slower queries (SSD latency vs RAM). Combine with quantization for maximum savings: on_disk=True + scalar quantization = 40x memory reduction. memmap_threshold=10000 (10MB) forces segments >10MB to disk. Production setup: NVMe SSD for <50ms p95 latency, keep payload indexes in RAM for filtering performance. Use cases: large archives, cost-sensitive deployments, >100M vectors. Not recommended for: latency-critical apps (<10ms SLA), high QPS (>1000/s). Qdrant Cloud: on_disk reduces costs 5-10x vs in-memory tier. Monitor: storage_mmap_size_bytes vs storage_ram_size_bytes.
Migration strategy: (1) Setup: Deploy Qdrant cluster (self-hosted or Cloud), create collection matching Pinecone config: dimension, distance metric (cosine/euclidean/dot). (2) Use official tool: Qdrant provides Docker-based migration tool (github.com/qdrant/migration) supporting Pinecone, Chroma, Weaviate with resumable transfers. Alternative: Manual migration. (3) Dual-write phase: Write new vectors to both Pinecone + Qdrant for 1-2 weeks. (4) Backfill: Export Pinecone data via list() + fetch(), batch upsert to Qdrant with upload_points(batch_size=1000, parallel=4). Script: import pinecone; pc_index = pinecone.Index('index'); for ids in batches: vectors = pc_index.fetch(ids); qdrant_client.upload_points(collection_name='migrated', points=[models.PointStruct(id=v['id'], vector=v['values'], payload=v['metadata']) for v in vectors]). (5) Validation: Compare query results (same vector, both systems, diff<1%). (6) Cutover: Route 100% reads to Qdrant, stop Pinecone writes. (7) Cleanup: Delete Pinecone index after 7-day safety period. Key differences: Pinecone namespaces → Qdrant filtering (no direct namespace concept), IDs as UUIDs vs strings. 2025 results: 1s faster API times, 2.5s → 1s total time improvements.