Cosine similarity thresholds for duplicate detection depend on embedding model and domain, but general guidelines exist. Cosine similarity ranges from -1 (opposite) to 1 (identical), with 0 being orthogonal. For semantic duplicate detection: 0.95-1.0 = Near-identical or exact duplicates (same content, minor phrasing differences). High precision, low recall. Use for: De-duplication, plagiarism detection. 0.85-0.95 = Strong semantic similarity (same meaning, different wording). Balanced precision/recall. Use for: Question matching, FAQ retrieval, content recommendations. 0.70-0.85 = Moderate similarity (related topics, overlapping concepts). Higher recall, lower precision. Use for: Related content discovery, broad topic matching. < 0.70 = Weak similarity (tangentially related or unrelated). High recall, very low precision. Model-specific thresholds: sentence-transformers/all-MiniLM-L6-v2: Duplicates typically > 0.85, related content 0.70-0.85. OpenAI text-embedding-3-small: Duplicates > 0.90, related 0.75-0.90. Cohere embed-english-v3.0: Duplicates > 0.88, related 0.72-0.88. Calibration approach: (1) Create labeled test set with known duplicates/non-duplicates. (2) Compute similarities for pairs. (3) Plot precision-recall curve across thresholds. (4) Select threshold balancing precision/recall for your use case. F1-optimized threshold often 0.80-0.90. Production pattern: Two-stage filtering: Recall stage (threshold 0.70) retrieves candidates, Precision stage (threshold 0.90) re-ranks for final duplicates. Edge cases: Short texts (< 10 words) need higher thresholds (0.90+) due to noise. Domain-specific: Legal/medical documents may need 0.92+ for duplicates. Code: FAQ/support tickets often 0.80-0.85 sufficient. Qdrant implementation: collection.search(query_vector=embedding, limit=10, score_threshold=0.85). Best practice: Start with 0.85 for duplicates, tune based on precision/recall metrics on validation set, use cosine distance metric (Qdrant default) not Euclidean for normalized embeddings. Essential for reliable semantic search.
Vector Db Optimization FAQ & Answers
8 expert Vector Db Optimization answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
8 questionsOptimal Qdrant upsert batch size is 50-500 vectors depending on vector dimensionality, payload size, and available memory. Larger batches amortize network/transaction overhead but risk timeouts and memory exhaustion. Performance characteristics: Batch size 1-10: High overhead from individual HTTP requests/transactions. Throughput: ~100-500 vectors/sec. Use for: Real-time single-document indexing. Batch size 50-100: Balanced overhead and memory. Throughput: ~2,000-5,000 vectors/sec for 384-dim embeddings. Recommended default. Use for: Streaming ingestion, incremental updates. Batch size 100-500: Near-optimal throughput. Throughput: ~5,000-10,000 vectors/sec for 384-dim, ~2,000-5,000 for 1536-dim. Use for: Bulk imports, batch processing. Batch size 500+: Diminishing returns, increased memory pressure, timeout risk for large payloads. Throughput plateaus. Only use with small dimensions (< 384) and minimal payload. Formula for batch size selection: batch_size = min(500, max(50, 10_000_000 / (dimensions * 4 + payload_size_bytes))). Example: 384-dim (1536 bytes) + 1KB payload → batch_size = min(500, 10MB / 2.5KB) = 500. 1536-dim (6144 bytes) + 5KB payload → batch_size = min(500, 10MB / 11KB) = min(500, 909) = 500. Implementation with asyncpg-style batching: vectors_batch = []; for vector, payload in data: vectors_batch.append(PointStruct(id=uuid4(), vector=vector, payload=payload)); if len(vectors_batch) >= batch_size: client.upsert(collection_name='my_collection', points=vectors_batch); vectors_batch = []. Qdrant-specific optimizations: Parallel upsert with wait=False for async indexing (trades immediate consistency for higher throughput), gRPC client instead of HTTP (10-30% faster for batch operations), wait='strong' for immediate consistency at cost of throughput. Memory considerations: Client-side batching consumes batch_size * (dimensions * 4 + payload_size) bytes. 500 vectors × 384-dim × 4 bytes/float = ~768KB per batch (manageable). Server-side: Qdrant buffers in memory before disk, large batches can trigger backpressure. Benchmarking: Test with time qdrant-benchmark or custom script measuring vectors/sec at different batch sizes for your data. Best practice: Start with batch_size=100 for 384-dim embeddings, increase to 500 if memory allows, monitor Qdrant memory/CPU with qdrant-exporter Prometheus metrics, use async upsert for bulk imports (wait=False). Essential for efficient vector ingestion.
Sentence-transformers models trade off dimensionality (vector size), inference speed (encoding time), and quality (retrieval accuracy). Common models from Hugging Face: all-MiniLM-L6-v2 (most popular): 384 dimensions, ~14,000 sentences/sec on CPU (multi-threaded), ~50 sentences/sec single-threaded, ~30,000-50,000/sec on GPU (T4). Size: 80MB. Quality: 58.8 avg on STS benchmark. Use for: High-throughput applications, resource-constrained environments, low-latency search. all-mpnet-base-v2: 768 dimensions, ~1,500 sentences/sec on CPU, ~15,000/sec on GPU. Size: 420MB. Quality: 63.3 avg on STS (best in class for size). Use for: Balanced quality/performance, general-purpose semantic search. all-distilroberta-v1: 768 dimensions, ~1,800 sentences/sec on CPU, ~18,000/sec on GPU. Size: 290MB. Quality: 61.8 avg. Use for: Slightly faster than mpnet with similar quality. gte-large: 1024 dimensions, ~800 sentences/sec on CPU, ~8,000/sec on GPU. Size: 670MB. Quality: State-of-art for complex queries. Use for: High-accuracy retrieval, research applications. OpenAI text-embedding-3-small: 1536 dimensions (API-only), ~1,000 sentences/sec via API (rate limited). Quality: Superior to open-source. Cost: $0.02/1M tokens. Use for: Production with budget, highest quality needs. Tradeoffs summary: Dimensions ↑ → Storage cost ↑ (Qdrant: 1536-dim uses 4× storage vs 384-dim), Search speed ↓ (higher-dim cosine similarity slower), Quality ↑ (more information capacity). Inference speed: Smaller models (MiniLM) 2-4× faster than large models (mpnet), GPU acceleration 10-20× speedup, batch encoding amortizes overhead (encode 32 sentences at once 5-10× faster than sequential). Memory: 384-dim: ~1.5KB/vector, 768-dim: ~3KB/vector, 1536-dim: ~6KB/vector. For 1M vectors: 384-dim = 1.5GB, 1536-dim = 6GB. Real-world benchmarks: MiniLM-L6: 50ms for 100 sentences on CPU (single-threaded), mpnet-base: 120ms for 100 sentences on CPU, gte-large: 250ms for 100 sentences on CPU. Selection criteria: High throughput, low latency → all-MiniLM-L6-v2 (384-dim). Balanced → all-mpnet-base-v2 (768-dim). Maximum quality → gte-large (1024-dim) or OpenAI API (1536-dim). Resource constrained → all-MiniLM-L6-v2. Best practice: Profile with your data: from sentence_transformers import SentenceTransformer; import time; model = SentenceTransformer('all-MiniLM-L6-v2'); sentences = ['test'] * 100; start = time.time(); embeddings = model.encode(sentences); print(f'{len(sentences) / (time.time() - start):.0f} sentences/sec'). Essential for embedding model selection.
Qdrant distance metric must match the embedding model's optimization objective for accurate similarity ranking. Three metrics: Cosine (default), Euclidean (L2), Dot Product. Cosine similarity: Measures angle between vectors, range [-1, 1]. Formula: cos(θ) = (A · B) / (||A|| × ||B||). Invariant to vector magnitude (normalized vectors). Use for: sentence-transformers models (all-MiniLM, all-mpnet, etc.) - these output L2-normalized embeddings where ||vector|| = 1. OpenAI text-embedding-* models (normalized by default). Any model trained with cosine similarity loss. Qdrant config: distance: Distance.COSINE. Best for: Most semantic search use cases (default choice). Euclidean distance (L2): Measures straight-line distance between vectors. Formula: L2 = sqrt(Σ(A_i - B_i)²). Sensitive to vector magnitude. Use for: Image embeddings from CNNs (ResNet, EfficientNet) - not L2-normalized. Word2Vec/GloVe (magnitude encodes frequency). Models trained with Euclidean distance loss. Qdrant config: distance: Distance.EUCLID. Note: For normalized vectors, Euclidean and Cosine are equivalent rankings (but different scales). Dot product: Raw inner product, range [-∞, +∞]. Formula: A · B = Σ(A_i × B_i). Combines angle and magnitude. Use for: CLIP image-text embeddings (uses dot product in training). Models with learned magnitude (e.g., BM25 hybrid scoring). Recommendation systems where magnitude encodes importance. Qdrant config: distance: Distance.DOT. Caution: Requires careful normalization. Verification: Check model documentation for training objective. Example: sentence-transformers/all-MiniLM-L6-v2 uses CosineSimilarityLoss → use Cosine. openai/clip-vit-base-patch32 uses contrastive loss with dot product → use Dot. Conversion: Normalized vectors: Cosine, Euclidean, and Dot are equivalent for ranking (Cosine = 1 - Euclidean² / 2 = Dot for ||A||=||B||=1). Qdrant collection creation: from qdrant_client.models import Distance, VectorParams; client.create_collection(collection_name='docs', vectors_config=VectorParams(size=384, distance=Distance.COSINE)). Common mistake: Using Euclidean for sentence-transformers (works but Cosine is more principled). Performance: Cosine and Dot have similar speed (~5% difference), Euclidean slightly slower due to sqrt. Best practice: (1) Default to Cosine for text embeddings (sentence-transformers, OpenAI). (2) Use Euclidean for image embeddings (CNNs). (3) Use Dot for CLIP and hybrid scoring. (4) Test on validation set if unsure. Essential for optimal vector search accuracy.
Cosine similarity thresholds for duplicate detection depend on embedding model and domain, but general guidelines exist. Cosine similarity ranges from -1 (opposite) to 1 (identical), with 0 being orthogonal. For semantic duplicate detection: 0.95-1.0 = Near-identical or exact duplicates (same content, minor phrasing differences). High precision, low recall. Use for: De-duplication, plagiarism detection. 0.85-0.95 = Strong semantic similarity (same meaning, different wording). Balanced precision/recall. Use for: Question matching, FAQ retrieval, content recommendations. 0.70-0.85 = Moderate similarity (related topics, overlapping concepts). Higher recall, lower precision. Use for: Related content discovery, broad topic matching. < 0.70 = Weak similarity (tangentially related or unrelated). High recall, very low precision. Model-specific thresholds: sentence-transformers/all-MiniLM-L6-v2: Duplicates typically > 0.85, related content 0.70-0.85. OpenAI text-embedding-3-small: Duplicates > 0.90, related 0.75-0.90. Cohere embed-english-v3.0: Duplicates > 0.88, related 0.72-0.88. Calibration approach: (1) Create labeled test set with known duplicates/non-duplicates. (2) Compute similarities for pairs. (3) Plot precision-recall curve across thresholds. (4) Select threshold balancing precision/recall for your use case. F1-optimized threshold often 0.80-0.90. Production pattern: Two-stage filtering: Recall stage (threshold 0.70) retrieves candidates, Precision stage (threshold 0.90) re-ranks for final duplicates. Edge cases: Short texts (< 10 words) need higher thresholds (0.90+) due to noise. Domain-specific: Legal/medical documents may need 0.92+ for duplicates. Code: FAQ/support tickets often 0.80-0.85 sufficient. Qdrant implementation: collection.search(query_vector=embedding, limit=10, score_threshold=0.85). Best practice: Start with 0.85 for duplicates, tune based on precision/recall metrics on validation set, use cosine distance metric (Qdrant default) not Euclidean for normalized embeddings. Essential for reliable semantic search.
Optimal Qdrant upsert batch size is 50-500 vectors depending on vector dimensionality, payload size, and available memory. Larger batches amortize network/transaction overhead but risk timeouts and memory exhaustion. Performance characteristics: Batch size 1-10: High overhead from individual HTTP requests/transactions. Throughput: ~100-500 vectors/sec. Use for: Real-time single-document indexing. Batch size 50-100: Balanced overhead and memory. Throughput: ~2,000-5,000 vectors/sec for 384-dim embeddings. Recommended default. Use for: Streaming ingestion, incremental updates. Batch size 100-500: Near-optimal throughput. Throughput: ~5,000-10,000 vectors/sec for 384-dim, ~2,000-5,000 for 1536-dim. Use for: Bulk imports, batch processing. Batch size 500+: Diminishing returns, increased memory pressure, timeout risk for large payloads. Throughput plateaus. Only use with small dimensions (< 384) and minimal payload. Formula for batch size selection: batch_size = min(500, max(50, 10_000_000 / (dimensions * 4 + payload_size_bytes))). Example: 384-dim (1536 bytes) + 1KB payload → batch_size = min(500, 10MB / 2.5KB) = 500. 1536-dim (6144 bytes) + 5KB payload → batch_size = min(500, 10MB / 11KB) = min(500, 909) = 500. Implementation with asyncpg-style batching: vectors_batch = []; for vector, payload in data: vectors_batch.append(PointStruct(id=uuid4(), vector=vector, payload=payload)); if len(vectors_batch) >= batch_size: client.upsert(collection_name='my_collection', points=vectors_batch); vectors_batch = []. Qdrant-specific optimizations: Parallel upsert with wait=False for async indexing (trades immediate consistency for higher throughput), gRPC client instead of HTTP (10-30% faster for batch operations), wait='strong' for immediate consistency at cost of throughput. Memory considerations: Client-side batching consumes batch_size * (dimensions * 4 + payload_size) bytes. 500 vectors × 384-dim × 4 bytes/float = ~768KB per batch (manageable). Server-side: Qdrant buffers in memory before disk, large batches can trigger backpressure. Benchmarking: Test with time qdrant-benchmark or custom script measuring vectors/sec at different batch sizes for your data. Best practice: Start with batch_size=100 for 384-dim embeddings, increase to 500 if memory allows, monitor Qdrant memory/CPU with qdrant-exporter Prometheus metrics, use async upsert for bulk imports (wait=False). Essential for efficient vector ingestion.
Sentence-transformers models trade off dimensionality (vector size), inference speed (encoding time), and quality (retrieval accuracy). Common models from Hugging Face: all-MiniLM-L6-v2 (most popular): 384 dimensions, ~14,000 sentences/sec on CPU (multi-threaded), ~50 sentences/sec single-threaded, ~30,000-50,000/sec on GPU (T4). Size: 80MB. Quality: 58.8 avg on STS benchmark. Use for: High-throughput applications, resource-constrained environments, low-latency search. all-mpnet-base-v2: 768 dimensions, ~1,500 sentences/sec on CPU, ~15,000/sec on GPU. Size: 420MB. Quality: 63.3 avg on STS (best in class for size). Use for: Balanced quality/performance, general-purpose semantic search. all-distilroberta-v1: 768 dimensions, ~1,800 sentences/sec on CPU, ~18,000/sec on GPU. Size: 290MB. Quality: 61.8 avg. Use for: Slightly faster than mpnet with similar quality. gte-large: 1024 dimensions, ~800 sentences/sec on CPU, ~8,000/sec on GPU. Size: 670MB. Quality: State-of-art for complex queries. Use for: High-accuracy retrieval, research applications. OpenAI text-embedding-3-small: 1536 dimensions (API-only), ~1,000 sentences/sec via API (rate limited). Quality: Superior to open-source. Cost: $0.02/1M tokens. Use for: Production with budget, highest quality needs. Tradeoffs summary: Dimensions ↑ → Storage cost ↑ (Qdrant: 1536-dim uses 4× storage vs 384-dim), Search speed ↓ (higher-dim cosine similarity slower), Quality ↑ (more information capacity). Inference speed: Smaller models (MiniLM) 2-4× faster than large models (mpnet), GPU acceleration 10-20× speedup, batch encoding amortizes overhead (encode 32 sentences at once 5-10× faster than sequential). Memory: 384-dim: ~1.5KB/vector, 768-dim: ~3KB/vector, 1536-dim: ~6KB/vector. For 1M vectors: 384-dim = 1.5GB, 1536-dim = 6GB. Real-world benchmarks: MiniLM-L6: 50ms for 100 sentences on CPU (single-threaded), mpnet-base: 120ms for 100 sentences on CPU, gte-large: 250ms for 100 sentences on CPU. Selection criteria: High throughput, low latency → all-MiniLM-L6-v2 (384-dim). Balanced → all-mpnet-base-v2 (768-dim). Maximum quality → gte-large (1024-dim) or OpenAI API (1536-dim). Resource constrained → all-MiniLM-L6-v2. Best practice: Profile with your data: from sentence_transformers import SentenceTransformer; import time; model = SentenceTransformer('all-MiniLM-L6-v2'); sentences = ['test'] * 100; start = time.time(); embeddings = model.encode(sentences); print(f'{len(sentences) / (time.time() - start):.0f} sentences/sec'). Essential for embedding model selection.
Qdrant distance metric must match the embedding model's optimization objective for accurate similarity ranking. Three metrics: Cosine (default), Euclidean (L2), Dot Product. Cosine similarity: Measures angle between vectors, range [-1, 1]. Formula: cos(θ) = (A · B) / (||A|| × ||B||). Invariant to vector magnitude (normalized vectors). Use for: sentence-transformers models (all-MiniLM, all-mpnet, etc.) - these output L2-normalized embeddings where ||vector|| = 1. OpenAI text-embedding-* models (normalized by default). Any model trained with cosine similarity loss. Qdrant config: distance: Distance.COSINE. Best for: Most semantic search use cases (default choice). Euclidean distance (L2): Measures straight-line distance between vectors. Formula: L2 = sqrt(Σ(A_i - B_i)²). Sensitive to vector magnitude. Use for: Image embeddings from CNNs (ResNet, EfficientNet) - not L2-normalized. Word2Vec/GloVe (magnitude encodes frequency). Models trained with Euclidean distance loss. Qdrant config: distance: Distance.EUCLID. Note: For normalized vectors, Euclidean and Cosine are equivalent rankings (but different scales). Dot product: Raw inner product, range [-∞, +∞]. Formula: A · B = Σ(A_i × B_i). Combines angle and magnitude. Use for: CLIP image-text embeddings (uses dot product in training). Models with learned magnitude (e.g., BM25 hybrid scoring). Recommendation systems where magnitude encodes importance. Qdrant config: distance: Distance.DOT. Caution: Requires careful normalization. Verification: Check model documentation for training objective. Example: sentence-transformers/all-MiniLM-L6-v2 uses CosineSimilarityLoss → use Cosine. openai/clip-vit-base-patch32 uses contrastive loss with dot product → use Dot. Conversion: Normalized vectors: Cosine, Euclidean, and Dot are equivalent for ranking (Cosine = 1 - Euclidean² / 2 = Dot for ||A||=||B||=1). Qdrant collection creation: from qdrant_client.models import Distance, VectorParams; client.create_collection(collection_name='docs', vectors_config=VectorParams(size=384, distance=Distance.COSINE)). Common mistake: Using Euclidean for sentence-transformers (works but Cosine is more principled). Performance: Cosine and Dot have similar speed (~5% difference), Euclidean slightly slower due to sqrt. Best practice: (1) Default to Cosine for text embeddings (sentence-transformers, OpenAI). (2) Use Euclidean for image embeddings (CNNs). (3) Use Dot for CLIP and hybrid scoring. (4) Test on validation set if unsure. Essential for optimal vector search accuracy.