Use the fetch() method for direct ID-based retrieval without similarity computation: index.fetch(ids=['vec1', 'vec2', 'vec3']). This returns vectors and metadata by ID only, avoiding computational overhead of similarity search. For retrieving all IDs in namespace first: results = await index.listPaginated({ prefix: 'doc1#' }), then fetch those IDs. Fetch is ideal when you know exact vector IDs and only need metadata/values. Example: result = index.fetch(ids=['0', '1'], namespace='documents'); returns dictionary with vectors and metadata. Fetch supports multiple IDs in single call (recommended batch size: 100-1000 IDs). No similarity ranking performed - pure key-value lookup. Use fetch for retrieval by known ID, use query() for semantic search.
Pinecone FAQ & Answers
50 expert Pinecone answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
50 questionsUse the new 2025-04 API delete method: index.delete(delete_all=True, namespace='example-namespace'). This deletes entire namespace and all records irreversibly. For selective deletion without rate limits in serverless indexes: reads and writes don't share compute resources, making large batch deletes safe. Python: from pinecone.grpc import PineconeGRPC as Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index(host='HOST'); index.delete(delete_all=True, namespace='ns'). cURL: curl -X DELETE "https://$INDEX_HOST/namespaces/$NAMESPACE" -H "Api-Key: $KEY" -H "X-Pinecone-API-Version: 2025-04". For pod-based indexes with many vectors: batch updates slowly to avoid affecting query latency. Serverless indexes handle large deletions better due to isolated compute. Namespace deletion is permanent - verify before executing.
Use metadata filters with $eq and $gte/$lte operators: index.query(vector=query_vector, filter={'user_id': {'$eq': 'user123'}, 'date': {'$gte': 20250101, '$lte': 20251231}}, top_k=10, include_metadata=True). Store dates as integers in YYYYMMDD format - string dates don't work with $gte/$lte (expects numbers). Supported operators: $eq (equals), $ne (not equals), $gt/$gte (greater than/equals), $lt/$lte (less than/equals), $in (in list). Combine multiple filters in same query - Pinecone applies all conditions. Full example: from pinecone import Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index('index-name'); results = index.query(...). Metadata filtering has same performance as namespace filtering. Include include_metadata=True to return metadata in results.
Use Pinecone sparse-dense vectors with BM25Encoder: from pinecone_text.sparse import BM25Encoder; bm25 = BM25Encoder().fit(corpus); sparse_vec = bm25.encode_queries('query text'); dense_vec = model.encode('query text'); index.query(vector=dense_vec, sparse_vector={'indices': sparse_vec['indices'], 'values': sparse_vec['values']}, top_k=10, namespace='hybrid'). Index must use dotproduct metric (only metric supporting sparse vectors). Upserts require sparse_values parameter for each vector. BM25Encoder: fit tf-idf values to your corpus (default values not recommended). Use multi-qa-MiniLM-L6-cos-v1 or similar for dense vectors. Hybrid search combines keyword relevance (BM25) with semantic understanding (embeddings). LangChain PineconeHybridSearchRetriever automates this pattern. Create index with metric='dotproduct' for sparse support.
IMPORTANT: Customers signing up for Standard/Enterprise on or after August 18, 2025 CANNOT create pod-based indexes - use serverless instead. For existing pods: p2 pods support 200 QPS per replica for vectors <128 dimensions with topK<50, returning queries in <10ms. For 100 concurrent requests: configure 2-3 replicas (200 QPS × 2 = 400 QPS capacity). Create with pod_type='p2.x1' or 'p2.x2'. Single p2.x8 pod supports >1000 QPS for 10M vectors (256-dim). Keep topK<50 for optimal performance. Increase replicas: index.configure_index(replicas=3). For 100K documents: p2.x1 with 2 replicas handles 100 concurrent queries at <10ms latency. Performance varies by: dimensionality, topK, filters, cloud provider. Scale replicas for throughput, scale pod size for larger datasets. Use gRPC client for best performance. Migration: pods deprecated, migrate to serverless for auto-scaling and 50x lower cost.
Ensure vector dimensions match index exactly: if index created with dimension=1536, all upserted vectors must have exactly 1536 values. Error 'Vector dimension X does not match the dimension of the index Y' indicates mismatch. Correct format: index.upsert(vectors=[{'id': 'vec1', 'values': [0.1, 0.2, ...], 'metadata': {'tags': ['a', 'b'], 'score': 0.95, 'text': 'content'}}], namespace='ns'). Metadata supports arrays, primitives, nested objects - structure doesn't affect dimension error. Common causes: (1) Wrong embedding model (ada-002=1536, text-embedding-3-large=3072), (2) Truncated/padded vectors, (3) Empty vectors (dimension 0). Verify: len(vector_values) == index_dimension before upsert. For 2025-01 API with integrated embeddings: index converts text to vectors automatically. Check embedding model output matches index dimension. Use try-except to catch dimension errors early.
Error occurs when batch exceeds 2MB limit: 'Request size 3MB exceeds the maximum supported size of 2MB'. Pinecone caps payload at 2MB per request. Each vector stores up to 40KB metadata. Solution: reduce batch size based on total bytes (vectors + metadata), not just count. Recommended: upsert batches up to 1000 records without exceeding 2MB. Calculate batch size: num_vectors * (dimensions * 4 bytes + metadata_size) < 2MB. For 1536-dim vectors: ~300 vectors with minimal metadata, fewer with large metadata. Use gRPC client for better performance: from pinecone.grpc import PineconeGRPC. Implement dynamic batching: check byte size before sending. Example: batch 100 vectors at a time for high-dimensional embeddings with metadata. Monitor total payload size, not just vector count. Serverless indexes handle batching more efficiently than pods.
Combine namespace parameter with filter in query: index.query(vector=query_vector, namespace='documents', filter={'status': {'$eq': 'processed'}}, top_k=10, include_metadata=True). For multiple namespaces use query_namespaces() utility: from pinecone import Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index('index'); combined = index.query_namespaces(vector=query_vec, namespaces=['ns1', 'ns2', 'ns3'], top_k=10, filter={'genre': {'$eq': 'comedy'}}). This runs query in parallel across namespaces and merges results into single ranked set. Single query limited to one namespace - use query_namespaces for multiple. Filter supports operators: $eq, $ne, $gt, $gte, $lt, $lte, $in. Performance identical for namespaces vs metadata filtering. Best practice: use namespaces for user/tenant isolation, metadata for attribute filtering within namespace.
Common causes: (1) Missing/incorrect API key in Vercel environment variables - verify API key is correct (check x-pinecone-auth-rejected-reason header for 'Wrong API key'), (2) Outdated Pinecone SDK requiring legacy environment parameter - update to latest client, (3) API key not accessible in serverless context. Fix: Set environment variables in Vercel dashboard (Project Settings > Environment Variables): PINECONE_API_KEY, PINECONE_CLOUD, PINECONE_REGION, PINECONE_INDEX. Redeploy after adding variables. Modern initialization (no environment parameter): const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY }). For serverless indexes: must use updated Pinecone client (older clients raise connection errors). Test locally with same environment variables before deploying. Check response headers for x-pinecone-auth-rejected-reason to diagnose auth issues. Use Pinecone's official Vercel starter template as reference for correct setup.
Use serverless for 95% of use cases: auto-scaling, 50x lower cost, 47% lower latency vs pods, no resource management. Serverless recommended for: variable workloads, cost-sensitive deployments, new projects, <10M vectors. Pod-based (legacy, being phased out): customers who sign up for Standard/Enterprise after August 18, 2025 CANNOT create pod-based indexes. Use pods only for: existing deployments migrating to serverless, specialized performance requirements with dedicated read nodes. Serverless features: live updates, metadata filtering, hybrid search, namespaces (all pod features). Cost: serverless usage-based (pay for reads/writes), pods fixed (pay for peak capacity even when idle). Migration: Pinecone recommends serverless for all new workloads. Create serverless: pc.create_index(name='index', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1')). Production: serverless is default choice in 2025.
Initialize Pinecone client with pool_threads for parallel requests: from pinecone import Pinecone; pc = Pinecone(api_key='key', pool_threads=8); index = pc.Index('index-name'). pool_threads controls concurrent API requests - more threads = higher throughput. Recommended: pool_threads=8-16 for production, 4+ minimum. Batch size: up to 1000 vectors per batch (max 2MB payload). For LangChain + OpenAI embeddings: use pool_threads>4, embedding_chunk_size>=1000, batch_size=64 for 5x speedup. Parallel batching: send multiple batches concurrently - pool_threads enables this. Example: upsert 100K vectors in batches of 1000 with 10 parallel threads = 10x faster than sequential. Trade-off: too many threads (>20) risks rate limits. Monitor: API rate limits (varies by plan), memory usage. Production: combine with async/await for maximum throughput. Serverless indexes handle parallel writes better than pods.
Create backup for serverless index: pc.create_backup(index_name='my-index', backup_name='backup-2025-11-15'). Backups are static copies, stored in same region as source index. Limitations: only for serverless indexes (not pods), max 2,000 namespaces, 50 backups per project quota, only includes vectors inserted 15+ minutes prior (recent vectors excluded). Restore from backup: pc.create_index(name='restored-index', backup_source='backup-2025-11-15', spec=ServerlessSpec(cloud='aws', region='us-east-1')). Restored index same region as backup. Use cases: protect against accidental deletes, system failures, rollback to known-good state. Pod-based alternative: create collection (static copy) instead of backup. Third-party: HYCU offers automated backup for Pinecone with namespace-level recovery. Production: schedule daily backups, test restore process, document backup retention policy. Backups != replication (backups static, replication live).
Two-stage retrieval: (1) vector search retrieves 100 candidates, (2) reranker scores top-10. Use Pinecone Inference API: from pinecone import Pinecone; pc = Pinecone(api_key='key'); results = pc.inference.rerank(model='bge-reranker-v2-m3', query='search query', documents=[{'id': '1', 'text': 'doc1'}, {'id': '2', 'text': 'doc2'}], top_n=10, return_documents=True). Models available: bge-reranker-v2-m3 (default), pinecone-rerank-v0, cohere-rerank-v3.5. Reranking processes query-document pairs, outputs similarity scores. Workflow: vector_results = index.query(vector=vec, top_k=100); docs = [{'id': r.id, 'text': r.metadata['text']} for r in vector_results]; reranked = pc.inference.rerank(model='bge-reranker-v2-m3', query=query, documents=docs, top_n=10). Cost: $0.002 per request (bge-reranker-v2-m3). Benefits: 20-30% accuracy improvement over vector-only. Production: rerank top-50 to top-100 candidates, cache rerank results for repeated queries. LangChain integration: from langchain_pinecone.rerank import PineconeRerank.
Alpha (α) controls weight between dense (semantic) and sparse (keyword) search. Range: 0.0 to 1.0. α=1.0: pure dense/semantic search (ignores sparse), α=0.0: pure sparse/keyword search (ignores dense), α=0.5: balanced (50% each). Formula: alpha * dense_vec + (1-alpha) * sparse_vec. Pinecone does NOT expose alpha parameter directly in API - you must scale vector values yourself before upserting/querying. Implementation: scale sparse values by (1-alpha) and dense values by alpha before combining. Tuning: start α=0.5, test on validation set, optimize for F1 score. Use cases: α=0.7-0.9 for semantic-heavy (QA, chatbots), α=0.3-0.5 for keyword-heavy (exact match, codes). Dynamic alpha: adjust per query type (questions → higher alpha, keywords → lower alpha). Production: A/B test different alphas, monitor click-through rate. IMPORTANT: Only indexes with dotproduct metric support sparse-dense vectors (hybrid search in public preview as of 2025).
Disk-based metadata filtering (2025 feature) stores metadata on disk instead of RAM - reduces memory footprint while maintaining query performance. Enabled automatically for new serverless indexes - no configuration needed. Technical implementation: uses bitmap indices (similar to data warehouses), low-cardinality bitmaps cached in memory (within budget), high-cardinality bitmaps streamed from disk and intersected with vector index. Architecture: immutable vector slabs in LSM-tree structure with metadata index per slab. Benefits: high-cardinality filters (millions of unique values), improved recall vs in-memory, lower cost. Use cases: user_id with millions of users, product_id with large catalogs, timestamp with granular precision. Performance: disk-based filtering as fast as in-memory for most queries (metadata stored on SSDs). Supports all filter operators: $eq, $ne, $gt, $gte, $lt, $lte, $in. Example: filter={'product_id': {'$in': list_of_1M_product_ids}} works efficiently. Production: use metadata filtering for high-cardinality partitioning (millions of partitions).
Performance identical - choose based on use case. Use namespaces: (1) Strict data isolation (tenant A cannot see tenant B data even with bugs), (2) <10K tenants, (3) Different vector sets per tenant, (4) Deletion by tenant (delete_all in namespace). Use metadata filtering: (1) >10K tenants (Pinecone supports millions of namespaces but metadata scales better), (2) Cross-tenant queries needed, (3) High-cardinality filtering (millions of users), (4) Flexible multi-dimensional filtering (tenant + date + category). Query namespaces: index.query(vector=vec, namespace='tenant123'). Query metadata: index.query(vector=vec, filter={'tenant_id': {'$eq': 'tenant123'}}). Combine both: namespace for primary partition (region/env), metadata for secondary (user/date). Future: delete by metadata (metadata filtering enables bulk deletes like namespaces). Production decision: <1K tenants → namespaces, >10K tenants → metadata, hybrid needs → both. Disk-based metadata filtering (2025) makes metadata approach scalable to millions.
Provisioned read capacity (early access, 2025) provides dedicated storage and compute resources for predictable performance with millions-billions of records and moderate-high QPS (1000+ queries/sec). Configuration uses API version 2025-10: set mode='Dedicated' in spec.serverless.read_capacity object, choose node type (b1 or t1), configure replicas based on throughput needs. Each shard provides 250GB storage. Request access: contact Pinecone sales for early access. Benefits: (1) Dedicated storage cached in memory+disk for low-latency queries, (2) No rate limits on read operations (query, list, fetch), (3) Predictable cost for reserved capacity, (4) Isolated resources (no noisy neighbors). When to use: (1) >1M QPS sustained, (2) <10ms p99 latency SLA required, (3) Production-critical workload, (4) Budget for reserved capacity. Default serverless: auto-scaling handles most traffic (variable workloads, burst traffic, cost optimization). Production: start with auto-scaling, upgrade to provisioned capacity when auto-scaling insufficient.
Pinecone Console provides usage monitoring: navigate to Usage dashboard for read units, write units, storage (GB), costs by index. Available to organization owners on Standard/Enterprise plans. Metrics breakdown: total requests, p50/p95/p99 latency, error rate, throttled requests. API-level tracking: query/fetch/list requests return usage parameter with read unit consumption; hosted embedding model requests return usage parameter with total tokens. Export: download CSV for billing analysis. Cost structure (serverless): pay per read unit (1 query = 1 RU), write unit (1 upsert = 1 WU), storage (GB-hour). Optimization: (1) Use serverless (50x cheaper than pods), (2) Reduce dimensionality (768 vs 1536 = 50% storage savings), (3) Clean unused indexes/namespaces, (4) Batch upserts (fewer WUs), (5) Cache query results (reduce RUs). Third-party: Datadog integration available for tracking requests, latency, usage trends. Production: tag indexes by environment (dev/staging/prod), review monthly usage, set budget alerts.
API key security: (1) Never commit keys to git (use .env files, gitignore), (2) Store in secrets manager (AWS Secrets Manager, Vercel env vars), (3) Rotate keys periodically (generate new key, migrate, delete old), (4) Use environment-specific keys (dev/staging/prod separate). Access control: Pinecone API keys have index-level permissions. Create restricted keys: Pinecone Console → API Keys → Create key → select specific indexes. Key types: read-write (full access), read-only (query only, no upserts/deletes). Client initialization: pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY')). Production best practices: (1) TLS/HTTPS only (Pinecone enforces), (2) Network isolation (VPC peering for Enterprise), (3) Audit logging (track API key usage), (4) Principle of least privilege (read-only keys for apps only querying). Compliance: SOC 2 Type II, GDPR compliant. Enterprise features: SSO, RBAC, private endpoints. Monitor: failed auth attempts (403 errors), unusual usage patterns.
Deploy indexes in multiple regions for global coverage. Pinecone serverless available in: AWS (us-west-2, us-east-1, eu-west-1), GCP (us-central1, europe-west4), Azure (eastus, westeurope). Create separate index per region: pc.create_index(name='index-us', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1')); pc.create_index(name='index-eu', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='eu-west-1')). Global control plane: requests to api.pinecone.io auto-route to nearest API server via Google Cloud global load balancer backed by Cloud Spanner (globally replicated). Client-side routing: use geo-DNS (Route 53, Cloudflare) or application logic to detect user region (IP geolocation), query regional index. Sync strategy: write to all regions (eventual consistency) or write to primary + replicate. Latency: p50 <10ms, p99 <50ms globally. Benefits: <50ms latency worldwide, regulatory compliance (data residency). Production: deploy app in same region as index for optimal performance.
Migration steps: (1) Create collection from pod index: pc.create_collection(name='migration-backup', source='pod-index-name'), (2) Create serverless index from collection: pc.create_index(name='serverless-index', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1'), collection_source='migration-backup'), (3) Validate: query both indexes, compare results (should match), (4) Cutover: update application to use serverless index, (5) Clean up: delete pod index after 7-day safety period. Limitations: same metric (cosine/euclidean/dotproduct) required, same dimension, collection max 2,000 namespaces. Test migration: use non-production index first. Dual-write approach (zero downtime): write to both pod + serverless during transition, switch reads gradually. Benefits after migration: 50x cost reduction, auto-scaling, lower latency. Gotchas: API changes (collection vs backup terminology), region compatibility (ensure serverless region matches pod region). Production: schedule migration during low-traffic window, monitor query performance post-migration.
Keep top_k small for best performance: top_k=10-50 recommended, top_k>100 slower. Query: index.query(vector=vec, top_k=10, include_metadata=True, include_values=False). include_metadata=True returns metadata (default), include_metadata=False omits metadata (faster, smaller response). include_values=True returns vector values (large payload), include_values=False omits vectors (recommended unless needed). Performance impact: top_k=10 vs top_k=100 → 2x slower. Large metadata (>10KB per vector) → slower queries. Optimization: (1) Request only needed metadata fields (future feature), (2) Use top_k=10-20 for production, (3) Fetch additional results with pagination vs large top_k, (4) Disable include_values unless required, (5) Filter before retrieval (reduces candidates), (6) Use sparse indexes for metadata-only queries. Production: monitor p95 latency, tune top_k based on use case (chatbots: top_k=5, analytics: top_k=50). Serverless auto-optimizes query performance.
Rate limits vary by plan: Free (5 API calls/sec), Starter (100/sec), Standard (1000/sec), Enterprise (custom). Status code 429: rate limit exceeded. Retry with exponential backoff: import time; from pinecone.exceptions import PineconeException; max_retries = 5; for attempt in range(max_retries): try: result = index.query(...); break; except PineconeException as e: if e.status == 429: wait = 2**attempt; time.sleep(wait); else: raise. Production retry library: tenacity: @retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5), retry=retry_if_exception_type(PineconeException)). Prevent rate limits: (1) Use pool_threads for parallel requests (distributes load), (2) Batch operations (upsert 1000 vectors vs 1000 individual upserts), (3) Monitor usage (stay below limit), (4) Upgrade plan for higher limits. Serverless advantage: reads/writes isolated - write batches don't affect query latency. Production: implement circuit breaker, monitor 429 errors, alert on threshold.
Versioning strategies: (1) Named indexes: create index-v1, index-v2 - switch application between versions (blue-green deployment). (2) Namespaces: use namespace as version (v1, v2, v3 namespaces in same index). (3) Metadata versioning: add version field to metadata, filter by version. Recommended: named indexes for major changes (embedding model upgrades), namespaces for minor versions. Implementation: pc.create_index(name='embeddings-v2', dimension=1536, spec=ServerlessSpec(...)). Dual-index approach: write to both v1 + v2 during transition, switch reads after validation, delete v1 after safety period. Rollback: revert application to use index-v1, fix issues, redeploy to v2. Namespace versioning: index.upsert(vectors, namespace='v2'); index.query(vector=vec, namespace='v2'). Metadata versioning: upsert with version field, query with filter={'version': 'v2'}. Production: version embeddings when model changes (text-embedding-3-small → 3-large), use backups for point-in-time recovery (serverless only), document version changelog.
Pinecone list API supports pagination with pagination_token for serverless indexes only. Use listPaginated() to retrieve vector IDs: results = await index.listPaginated({namespace: 'docs', limit: 100, prefix: 'doc1#'}). Returns up to 100 IDs by default (configurable with limit parameter). Response includes pagination_token when more IDs exist; pass token to get next batch. When no pagination_token in response: all IDs retrieved. Python SDK: auto-paginates with list(), or manual with list_paginated(). Other SDKs (Node.js, Java, Go, .NET) + REST API: manual pagination required. For query results: no built-in pagination - workarounds: (1) Retrieve all results (top_k=100) once, paginate client-side, (2) ID-based exclusion: filter={'id': {'$nin': seen_ids}}, (3) Metadata cursor: filter={'created_at': {'$gt': last_seen_timestamp}}, order client-side. Best practice: cache large result sets (Redis), limit top_k to reasonable value (100-500), use metadata filtering to narrow scope before pagination.
Use the fetch() method for direct ID-based retrieval without similarity computation: index.fetch(ids=['vec1', 'vec2', 'vec3']). This returns vectors and metadata by ID only, avoiding computational overhead of similarity search. For retrieving all IDs in namespace first: results = await index.listPaginated({ prefix: 'doc1#' }), then fetch those IDs. Fetch is ideal when you know exact vector IDs and only need metadata/values. Example: result = index.fetch(ids=['0', '1'], namespace='documents'); returns dictionary with vectors and metadata. Fetch supports multiple IDs in single call (recommended batch size: 100-1000 IDs). No similarity ranking performed - pure key-value lookup. Use fetch for retrieval by known ID, use query() for semantic search.
Use the new 2025-04 API delete method: index.delete(delete_all=True, namespace='example-namespace'). This deletes entire namespace and all records irreversibly. For selective deletion without rate limits in serverless indexes: reads and writes don't share compute resources, making large batch deletes safe. Python: from pinecone.grpc import PineconeGRPC as Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index(host='HOST'); index.delete(delete_all=True, namespace='ns'). cURL: curl -X DELETE "https://$INDEX_HOST/namespaces/$NAMESPACE" -H "Api-Key: $KEY" -H "X-Pinecone-API-Version: 2025-04". For pod-based indexes with many vectors: batch updates slowly to avoid affecting query latency. Serverless indexes handle large deletions better due to isolated compute. Namespace deletion is permanent - verify before executing.
Use metadata filters with $eq and $gte/$lte operators: index.query(vector=query_vector, filter={'user_id': {'$eq': 'user123'}, 'date': {'$gte': 20250101, '$lte': 20251231}}, top_k=10, include_metadata=True). Store dates as integers in YYYYMMDD format - string dates don't work with $gte/$lte (expects numbers). Supported operators: $eq (equals), $ne (not equals), $gt/$gte (greater than/equals), $lt/$lte (less than/equals), $in (in list). Combine multiple filters in same query - Pinecone applies all conditions. Full example: from pinecone import Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index('index-name'); results = index.query(...). Metadata filtering has same performance as namespace filtering. Include include_metadata=True to return metadata in results.
Use Pinecone sparse-dense vectors with BM25Encoder: from pinecone_text.sparse import BM25Encoder; bm25 = BM25Encoder().fit(corpus); sparse_vec = bm25.encode_queries('query text'); dense_vec = model.encode('query text'); index.query(vector=dense_vec, sparse_vector={'indices': sparse_vec['indices'], 'values': sparse_vec['values']}, top_k=10, namespace='hybrid'). Index must use dotproduct metric (only metric supporting sparse vectors). Upserts require sparse_values parameter for each vector. BM25Encoder: fit tf-idf values to your corpus (default values not recommended). Use multi-qa-MiniLM-L6-cos-v1 or similar for dense vectors. Hybrid search combines keyword relevance (BM25) with semantic understanding (embeddings). LangChain PineconeHybridSearchRetriever automates this pattern. Create index with metric='dotproduct' for sparse support.
IMPORTANT: Customers signing up for Standard/Enterprise on or after August 18, 2025 CANNOT create pod-based indexes - use serverless instead. For existing pods: p2 pods support 200 QPS per replica for vectors <128 dimensions with topK<50, returning queries in <10ms. For 100 concurrent requests: configure 2-3 replicas (200 QPS × 2 = 400 QPS capacity). Create with pod_type='p2.x1' or 'p2.x2'. Single p2.x8 pod supports >1000 QPS for 10M vectors (256-dim). Keep topK<50 for optimal performance. Increase replicas: index.configure_index(replicas=3). For 100K documents: p2.x1 with 2 replicas handles 100 concurrent queries at <10ms latency. Performance varies by: dimensionality, topK, filters, cloud provider. Scale replicas for throughput, scale pod size for larger datasets. Use gRPC client for best performance. Migration: pods deprecated, migrate to serverless for auto-scaling and 50x lower cost.
Ensure vector dimensions match index exactly: if index created with dimension=1536, all upserted vectors must have exactly 1536 values. Error 'Vector dimension X does not match the dimension of the index Y' indicates mismatch. Correct format: index.upsert(vectors=[{'id': 'vec1', 'values': [0.1, 0.2, ...], 'metadata': {'tags': ['a', 'b'], 'score': 0.95, 'text': 'content'}}], namespace='ns'). Metadata supports arrays, primitives, nested objects - structure doesn't affect dimension error. Common causes: (1) Wrong embedding model (ada-002=1536, text-embedding-3-large=3072), (2) Truncated/padded vectors, (3) Empty vectors (dimension 0). Verify: len(vector_values) == index_dimension before upsert. For 2025-01 API with integrated embeddings: index converts text to vectors automatically. Check embedding model output matches index dimension. Use try-except to catch dimension errors early.
Error occurs when batch exceeds 2MB limit: 'Request size 3MB exceeds the maximum supported size of 2MB'. Pinecone caps payload at 2MB per request. Each vector stores up to 40KB metadata. Solution: reduce batch size based on total bytes (vectors + metadata), not just count. Recommended: upsert batches up to 1000 records without exceeding 2MB. Calculate batch size: num_vectors * (dimensions * 4 bytes + metadata_size) < 2MB. For 1536-dim vectors: ~300 vectors with minimal metadata, fewer with large metadata. Use gRPC client for better performance: from pinecone.grpc import PineconeGRPC. Implement dynamic batching: check byte size before sending. Example: batch 100 vectors at a time for high-dimensional embeddings with metadata. Monitor total payload size, not just vector count. Serverless indexes handle batching more efficiently than pods.
Combine namespace parameter with filter in query: index.query(vector=query_vector, namespace='documents', filter={'status': {'$eq': 'processed'}}, top_k=10, include_metadata=True). For multiple namespaces use query_namespaces() utility: from pinecone import Pinecone; pc = Pinecone(api_key='KEY'); index = pc.Index('index'); combined = index.query_namespaces(vector=query_vec, namespaces=['ns1', 'ns2', 'ns3'], top_k=10, filter={'genre': {'$eq': 'comedy'}}). This runs query in parallel across namespaces and merges results into single ranked set. Single query limited to one namespace - use query_namespaces for multiple. Filter supports operators: $eq, $ne, $gt, $gte, $lt, $lte, $in. Performance identical for namespaces vs metadata filtering. Best practice: use namespaces for user/tenant isolation, metadata for attribute filtering within namespace.
Common causes: (1) Missing/incorrect API key in Vercel environment variables - verify API key is correct (check x-pinecone-auth-rejected-reason header for 'Wrong API key'), (2) Outdated Pinecone SDK requiring legacy environment parameter - update to latest client, (3) API key not accessible in serverless context. Fix: Set environment variables in Vercel dashboard (Project Settings > Environment Variables): PINECONE_API_KEY, PINECONE_CLOUD, PINECONE_REGION, PINECONE_INDEX. Redeploy after adding variables. Modern initialization (no environment parameter): const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY }). For serverless indexes: must use updated Pinecone client (older clients raise connection errors). Test locally with same environment variables before deploying. Check response headers for x-pinecone-auth-rejected-reason to diagnose auth issues. Use Pinecone's official Vercel starter template as reference for correct setup.
Use serverless for 95% of use cases: auto-scaling, 50x lower cost, 47% lower latency vs pods, no resource management. Serverless recommended for: variable workloads, cost-sensitive deployments, new projects, <10M vectors. Pod-based (legacy, being phased out): customers who sign up for Standard/Enterprise after August 18, 2025 CANNOT create pod-based indexes. Use pods only for: existing deployments migrating to serverless, specialized performance requirements with dedicated read nodes. Serverless features: live updates, metadata filtering, hybrid search, namespaces (all pod features). Cost: serverless usage-based (pay for reads/writes), pods fixed (pay for peak capacity even when idle). Migration: Pinecone recommends serverless for all new workloads. Create serverless: pc.create_index(name='index', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1')). Production: serverless is default choice in 2025.
Initialize Pinecone client with pool_threads for parallel requests: from pinecone import Pinecone; pc = Pinecone(api_key='key', pool_threads=8); index = pc.Index('index-name'). pool_threads controls concurrent API requests - more threads = higher throughput. Recommended: pool_threads=8-16 for production, 4+ minimum. Batch size: up to 1000 vectors per batch (max 2MB payload). For LangChain + OpenAI embeddings: use pool_threads>4, embedding_chunk_size>=1000, batch_size=64 for 5x speedup. Parallel batching: send multiple batches concurrently - pool_threads enables this. Example: upsert 100K vectors in batches of 1000 with 10 parallel threads = 10x faster than sequential. Trade-off: too many threads (>20) risks rate limits. Monitor: API rate limits (varies by plan), memory usage. Production: combine with async/await for maximum throughput. Serverless indexes handle parallel writes better than pods.
Create backup for serverless index: pc.create_backup(index_name='my-index', backup_name='backup-2025-11-15'). Backups are static copies, stored in same region as source index. Limitations: only for serverless indexes (not pods), max 2,000 namespaces, 50 backups per project quota, only includes vectors inserted 15+ minutes prior (recent vectors excluded). Restore from backup: pc.create_index(name='restored-index', backup_source='backup-2025-11-15', spec=ServerlessSpec(cloud='aws', region='us-east-1')). Restored index same region as backup. Use cases: protect against accidental deletes, system failures, rollback to known-good state. Pod-based alternative: create collection (static copy) instead of backup. Third-party: HYCU offers automated backup for Pinecone with namespace-level recovery. Production: schedule daily backups, test restore process, document backup retention policy. Backups != replication (backups static, replication live).
Two-stage retrieval: (1) vector search retrieves 100 candidates, (2) reranker scores top-10. Use Pinecone Inference API: from pinecone import Pinecone; pc = Pinecone(api_key='key'); results = pc.inference.rerank(model='bge-reranker-v2-m3', query='search query', documents=[{'id': '1', 'text': 'doc1'}, {'id': '2', 'text': 'doc2'}], top_n=10, return_documents=True). Models available: bge-reranker-v2-m3 (default), pinecone-rerank-v0, cohere-rerank-v3.5. Reranking processes query-document pairs, outputs similarity scores. Workflow: vector_results = index.query(vector=vec, top_k=100); docs = [{'id': r.id, 'text': r.metadata['text']} for r in vector_results]; reranked = pc.inference.rerank(model='bge-reranker-v2-m3', query=query, documents=docs, top_n=10). Cost: $0.002 per request (bge-reranker-v2-m3). Benefits: 20-30% accuracy improvement over vector-only. Production: rerank top-50 to top-100 candidates, cache rerank results for repeated queries. LangChain integration: from langchain_pinecone.rerank import PineconeRerank.
Alpha (α) controls weight between dense (semantic) and sparse (keyword) search. Range: 0.0 to 1.0. α=1.0: pure dense/semantic search (ignores sparse), α=0.0: pure sparse/keyword search (ignores dense), α=0.5: balanced (50% each). Formula: alpha * dense_vec + (1-alpha) * sparse_vec. Pinecone does NOT expose alpha parameter directly in API - you must scale vector values yourself before upserting/querying. Implementation: scale sparse values by (1-alpha) and dense values by alpha before combining. Tuning: start α=0.5, test on validation set, optimize for F1 score. Use cases: α=0.7-0.9 for semantic-heavy (QA, chatbots), α=0.3-0.5 for keyword-heavy (exact match, codes). Dynamic alpha: adjust per query type (questions → higher alpha, keywords → lower alpha). Production: A/B test different alphas, monitor click-through rate. IMPORTANT: Only indexes with dotproduct metric support sparse-dense vectors (hybrid search in public preview as of 2025).
Disk-based metadata filtering (2025 feature) stores metadata on disk instead of RAM - reduces memory footprint while maintaining query performance. Enabled automatically for new serverless indexes - no configuration needed. Technical implementation: uses bitmap indices (similar to data warehouses), low-cardinality bitmaps cached in memory (within budget), high-cardinality bitmaps streamed from disk and intersected with vector index. Architecture: immutable vector slabs in LSM-tree structure with metadata index per slab. Benefits: high-cardinality filters (millions of unique values), improved recall vs in-memory, lower cost. Use cases: user_id with millions of users, product_id with large catalogs, timestamp with granular precision. Performance: disk-based filtering as fast as in-memory for most queries (metadata stored on SSDs). Supports all filter operators: $eq, $ne, $gt, $gte, $lt, $lte, $in. Example: filter={'product_id': {'$in': list_of_1M_product_ids}} works efficiently. Production: use metadata filtering for high-cardinality partitioning (millions of partitions).
Performance identical - choose based on use case. Use namespaces: (1) Strict data isolation (tenant A cannot see tenant B data even with bugs), (2) <10K tenants, (3) Different vector sets per tenant, (4) Deletion by tenant (delete_all in namespace). Use metadata filtering: (1) >10K tenants (Pinecone supports millions of namespaces but metadata scales better), (2) Cross-tenant queries needed, (3) High-cardinality filtering (millions of users), (4) Flexible multi-dimensional filtering (tenant + date + category). Query namespaces: index.query(vector=vec, namespace='tenant123'). Query metadata: index.query(vector=vec, filter={'tenant_id': {'$eq': 'tenant123'}}). Combine both: namespace for primary partition (region/env), metadata for secondary (user/date). Future: delete by metadata (metadata filtering enables bulk deletes like namespaces). Production decision: <1K tenants → namespaces, >10K tenants → metadata, hybrid needs → both. Disk-based metadata filtering (2025) makes metadata approach scalable to millions.
Provisioned read capacity (early access, 2025) provides dedicated storage and compute resources for predictable performance with millions-billions of records and moderate-high QPS (1000+ queries/sec). Configuration uses API version 2025-10: set mode='Dedicated' in spec.serverless.read_capacity object, choose node type (b1 or t1), configure replicas based on throughput needs. Each shard provides 250GB storage. Request access: contact Pinecone sales for early access. Benefits: (1) Dedicated storage cached in memory+disk for low-latency queries, (2) No rate limits on read operations (query, list, fetch), (3) Predictable cost for reserved capacity, (4) Isolated resources (no noisy neighbors). When to use: (1) >1M QPS sustained, (2) <10ms p99 latency SLA required, (3) Production-critical workload, (4) Budget for reserved capacity. Default serverless: auto-scaling handles most traffic (variable workloads, burst traffic, cost optimization). Production: start with auto-scaling, upgrade to provisioned capacity when auto-scaling insufficient.
Pinecone Console provides usage monitoring: navigate to Usage dashboard for read units, write units, storage (GB), costs by index. Available to organization owners on Standard/Enterprise plans. Metrics breakdown: total requests, p50/p95/p99 latency, error rate, throttled requests. API-level tracking: query/fetch/list requests return usage parameter with read unit consumption; hosted embedding model requests return usage parameter with total tokens. Export: download CSV for billing analysis. Cost structure (serverless): pay per read unit (1 query = 1 RU), write unit (1 upsert = 1 WU), storage (GB-hour). Optimization: (1) Use serverless (50x cheaper than pods), (2) Reduce dimensionality (768 vs 1536 = 50% storage savings), (3) Clean unused indexes/namespaces, (4) Batch upserts (fewer WUs), (5) Cache query results (reduce RUs). Third-party: Datadog integration available for tracking requests, latency, usage trends. Production: tag indexes by environment (dev/staging/prod), review monthly usage, set budget alerts.
API key security: (1) Never commit keys to git (use .env files, gitignore), (2) Store in secrets manager (AWS Secrets Manager, Vercel env vars), (3) Rotate keys periodically (generate new key, migrate, delete old), (4) Use environment-specific keys (dev/staging/prod separate). Access control: Pinecone API keys have index-level permissions. Create restricted keys: Pinecone Console → API Keys → Create key → select specific indexes. Key types: read-write (full access), read-only (query only, no upserts/deletes). Client initialization: pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY')). Production best practices: (1) TLS/HTTPS only (Pinecone enforces), (2) Network isolation (VPC peering for Enterprise), (3) Audit logging (track API key usage), (4) Principle of least privilege (read-only keys for apps only querying). Compliance: SOC 2 Type II, GDPR compliant. Enterprise features: SSO, RBAC, private endpoints. Monitor: failed auth attempts (403 errors), unusual usage patterns.
Deploy indexes in multiple regions for global coverage. Pinecone serverless available in: AWS (us-west-2, us-east-1, eu-west-1), GCP (us-central1, europe-west4), Azure (eastus, westeurope). Create separate index per region: pc.create_index(name='index-us', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1')); pc.create_index(name='index-eu', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='eu-west-1')). Global control plane: requests to api.pinecone.io auto-route to nearest API server via Google Cloud global load balancer backed by Cloud Spanner (globally replicated). Client-side routing: use geo-DNS (Route 53, Cloudflare) or application logic to detect user region (IP geolocation), query regional index. Sync strategy: write to all regions (eventual consistency) or write to primary + replicate. Latency: p50 <10ms, p99 <50ms globally. Benefits: <50ms latency worldwide, regulatory compliance (data residency). Production: deploy app in same region as index for optimal performance.
Migration steps: (1) Create collection from pod index: pc.create_collection(name='migration-backup', source='pod-index-name'), (2) Create serverless index from collection: pc.create_index(name='serverless-index', dimension=1536, metric='cosine', spec=ServerlessSpec(cloud='aws', region='us-east-1'), collection_source='migration-backup'), (3) Validate: query both indexes, compare results (should match), (4) Cutover: update application to use serverless index, (5) Clean up: delete pod index after 7-day safety period. Limitations: same metric (cosine/euclidean/dotproduct) required, same dimension, collection max 2,000 namespaces. Test migration: use non-production index first. Dual-write approach (zero downtime): write to both pod + serverless during transition, switch reads gradually. Benefits after migration: 50x cost reduction, auto-scaling, lower latency. Gotchas: API changes (collection vs backup terminology), region compatibility (ensure serverless region matches pod region). Production: schedule migration during low-traffic window, monitor query performance post-migration.
Keep top_k small for best performance: top_k=10-50 recommended, top_k>100 slower. Query: index.query(vector=vec, top_k=10, include_metadata=True, include_values=False). include_metadata=True returns metadata (default), include_metadata=False omits metadata (faster, smaller response). include_values=True returns vector values (large payload), include_values=False omits vectors (recommended unless needed). Performance impact: top_k=10 vs top_k=100 → 2x slower. Large metadata (>10KB per vector) → slower queries. Optimization: (1) Request only needed metadata fields (future feature), (2) Use top_k=10-20 for production, (3) Fetch additional results with pagination vs large top_k, (4) Disable include_values unless required, (5) Filter before retrieval (reduces candidates), (6) Use sparse indexes for metadata-only queries. Production: monitor p95 latency, tune top_k based on use case (chatbots: top_k=5, analytics: top_k=50). Serverless auto-optimizes query performance.
Rate limits vary by plan: Free (5 API calls/sec), Starter (100/sec), Standard (1000/sec), Enterprise (custom). Status code 429: rate limit exceeded. Retry with exponential backoff: import time; from pinecone.exceptions import PineconeException; max_retries = 5; for attempt in range(max_retries): try: result = index.query(...); break; except PineconeException as e: if e.status == 429: wait = 2**attempt; time.sleep(wait); else: raise. Production retry library: tenacity: @retry(wait=wait_exponential(min=1, max=60), stop=stop_after_attempt(5), retry=retry_if_exception_type(PineconeException)). Prevent rate limits: (1) Use pool_threads for parallel requests (distributes load), (2) Batch operations (upsert 1000 vectors vs 1000 individual upserts), (3) Monitor usage (stay below limit), (4) Upgrade plan for higher limits. Serverless advantage: reads/writes isolated - write batches don't affect query latency. Production: implement circuit breaker, monitor 429 errors, alert on threshold.
Versioning strategies: (1) Named indexes: create index-v1, index-v2 - switch application between versions (blue-green deployment). (2) Namespaces: use namespace as version (v1, v2, v3 namespaces in same index). (3) Metadata versioning: add version field to metadata, filter by version. Recommended: named indexes for major changes (embedding model upgrades), namespaces for minor versions. Implementation: pc.create_index(name='embeddings-v2', dimension=1536, spec=ServerlessSpec(...)). Dual-index approach: write to both v1 + v2 during transition, switch reads after validation, delete v1 after safety period. Rollback: revert application to use index-v1, fix issues, redeploy to v2. Namespace versioning: index.upsert(vectors, namespace='v2'); index.query(vector=vec, namespace='v2'). Metadata versioning: upsert with version field, query with filter={'version': 'v2'}. Production: version embeddings when model changes (text-embedding-3-small → 3-large), use backups for point-in-time recovery (serverless only), document version changelog.
Pinecone list API supports pagination with pagination_token for serverless indexes only. Use listPaginated() to retrieve vector IDs: results = await index.listPaginated({namespace: 'docs', limit: 100, prefix: 'doc1#'}). Returns up to 100 IDs by default (configurable with limit parameter). Response includes pagination_token when more IDs exist; pass token to get next batch. When no pagination_token in response: all IDs retrieved. Python SDK: auto-paginates with list(), or manual with list_paginated(). Other SDKs (Node.js, Java, Go, .NET) + REST API: manual pagination required. For query results: no built-in pagination - workarounds: (1) Retrieve all results (top_k=100) once, paginate client-side, (2) ID-based exclusion: filter={'id': {'$nin': seen_ids}}, (3) Metadata cursor: filter={'created_at': {'$gt': last_seen_timestamp}}, order client-side. Best practice: cache large result sets (Redis), limit top_k to reasonable value (100-500), use metadata filtering to narrow scope before pagination.