OpenAI API provides access to AI models via REST API. Models (2025): GPT-4o (multimodal flagship, $2.50/1M input, $10/1M output tokens, 128k context), GPT-4o-mini (cost-effective, $0.15/1M input), o1 (advanced reasoning, multimodal, 83% AIME math), o1-mini (STEM-focused, 80% cheaper), GPT-4 Turbo (legacy, 128k context), text-embedding-3-small/large (semantic search, $0.02/$0.13 per 1M tokens), DALL-E 3 (image generation), Whisper v3 (speech-to-text, 102 languages), TTS (text-to-speech). Use official SDKs: pip install openai (Python 1.0+), npm install openai (Node.js). Authenticate via API keys in headers: Authorization: Bearer sk-.... Essential for production AI applications.
OpenAI FAQ & Answers
48 expert OpenAI answers researched from official documentation. Every answer cites authoritative sources you can verify.
unknown
48 questionsChat Completions API generates text via POST /v1/chat/completions. Python example: client.chat.completions.create(model='gpt-4o', messages=[{'role': 'system', 'content': 'You are a helpful assistant'}, {'role': 'user', 'content': 'Hello'}]). Key parameters: model (gpt-4o/gpt-4o-mini/o1), temperature (0-2 randomness, 0=deterministic), max_tokens (output limit, up to 16,384 for GPT-4o), top_p (nucleus sampling), response_format (for JSON mode/structured outputs with strict: true). Response: choices[0].message.content. Streaming: stream=True for incremental SSE responses. Supports prompt caching (auto for 1024+ tokens, 50% cost reduction). 128k context window for GPT-4o. Essential for production chatbots.
Tokens are text chunks (words/subwords) that models process. English: ~4 chars per token, ~0.75 words per token. Both input and output tokens are billed separately. Token limits (2025): GPT-4o (128k context, 16k max output), o1 (128k context), GPT-4o-mini (128k context). Count tokens: pip install tiktoken, then encoding = tiktoken.encoding_for_model('gpt-4o'); tokens = encoding.encode(text); count = len(tokens). Pricing: GPT-4o input $2.50/1M, output $10/1M (4x more expensive). Optimize costs: (1) concise prompts, (2) limit max_tokens, (3) use prompt caching for repeated prefixes (50% discount), (4) choose GPT-4o-mini for simple tasks. Monitor via usage dashboard.
Function calling (now tools parameter, functions deprecated) enables models to call your functions with structured JSON arguments. Python example: tools=[{'type': 'function', 'function': {'name': 'get_weather', 'description': 'Get weather', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string'}}, 'required': ['location']}, 'strict': True}}]. Set strict=True for guaranteed schema adherence (100% accuracy vs <40% without). Model returns tool_calls array with id, name, arguments. Execute function, send result back: {'role': 'tool', 'tool_call_id': id, 'content': result}. Parallel tool calling supported (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo). o1/o1-mini don't support parallel calls. Use for: database queries, API integration, calculations.
Embeddings convert text into vector representations capturing semantic meaning. Models (2025): text-embedding-3-small ($0.02/1M tokens, 1,536 dims, best value), text-embedding-3-large ($0.13/1M tokens, 3,072 dims, highest quality). Python example: response = client.embeddings.create(model='text-embedding-3-small', input='Your text here'); vector = response.data[0].embedding. Max input: 8,192 tokens per call, batch up to 2,048 inputs. Dimensions configurable via encoding_format parameter. Use for: semantic search (RAG), clustering, recommendations, duplicate detection. Measure similarity: cosine_similarity(vec1, vec2). Store in: Pinecone, Weaviate, Qdrant, pgvector. Only input tokens billed (no output tokens).
Message roles structure conversations. System: sets AI behavior/personality (e.g., 'You are a helpful coding assistant'). Processed first, influences all responses. Max 256k chars. User: user input, questions. Assistant: model's previous responses (for conversation history). Python example: messages=[{'role': 'system', 'content': 'Be concise'}, {'role': 'user', 'content': 'What is Python?'}, {'role': 'assistant', 'content': 'Programming language'}, {'role': 'user', 'content': 'Tell me more'}]. Best practices: (1) clear system message for consistent behavior, (2) include conversation history for context, (3) system message optional but recommended. Prompt caching benefits: system messages reused across calls get 50% discount on cached tokens.
Streaming returns incremental responses via Server-Sent Events (SSE) for lower latency. Python example: stream = client.chat.completions.create(model='gpt-4o', messages=[...], stream=True); for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end=''). Response: series of chunks with delta.content. Final chunk has finish_reason='stop'. Benefits: (1) progressive UI updates, (2) perceived latency reduction, (3) better UX for long responses. TypeScript: use async iterators. Works with Chat Completions API. Use for: chatbots, real-time apps, streaming interfaces. SDKs handle SSE parsing automatically. Essential for responsive production applications.
Temperature (0-2, default 1) controls randomness in token selection. 0: deterministic, always picks highest probability token (use for factual Q&A, code generation, structured data). 0.7: balanced creativity and coherence (content generation). 1.5-2: highly creative, diverse outputs (brainstorming, creative writing). Python example: client.chat.completions.create(model='gpt-4o', temperature=0, seed=123, messages=[...]). Set seed parameter with temperature=0 for reproducible outputs across calls. Alternative: top_p (nucleus sampling, 0-1, typically 0.1 or 0.9). Don't modify both temperature and top_p simultaneously. Use temperature for most use cases. Critical for controlling output determinism vs creativity.
RAG combines retrieval with generation for accurate, up-to-date responses. Implementation: (1) Embed knowledge base: embeddings = client.embeddings.create(model='text-embedding-3-small', input=docs), store in vector DB (Pinecone, Qdrant, pgvector, Weaviate). (2) Query: embed user question, retrieve top-k similar docs via cosine similarity. (3) Augment prompt: context = '\n'.join(retrieved_docs); messages=[{'role': 'system', 'content': f'Answer using: {context}'}, {'role': 'user', 'content': question}]. (4) Generate: client.chat.completions.create(model='gpt-4o', messages=messages). Benefits: reduces hallucinations, provides source citations, handles dynamic data. Use text-embedding-3-small ($0.02/1M) for cost efficiency. Essential for Q&A, documentation search, customer support.
Prompt engineering best practices (2025): (1) Be specific: 'Write Python function to validate email using regex' not 'Write code'. (2) Few-shot learning: provide 2-3 examples of desired output. (3) Use delimiters: triple quotes for user input, XML tags for sections. (4) Specify format: 'Return JSON with keys: name, age, city'. (5) Chain-of-thought: 'Let's think step by step' for reasoning tasks (esp. o1 models). (6) System messages: set behavior/constraints. (7) Use structured outputs: response_format={'type': 'json_schema', 'json_schema': schema, 'strict': True} for guaranteed schema. (8) Test temperatures: 0 for factual, 0.7 for creative. Avoid: vague instructions, assuming knowledge. Use o1 for complex reasoning. Essential for production quality.
Whisper API (v3, 2025) transcribes/translates audio with 12.4% avg error rate across 102 languages. Endpoints: POST /v1/audio/transcriptions (same language) or /v1/audio/translations (to English). Python example: audio_file = open('audio.mp3', 'rb'); transcript = client.audio.transcriptions.create(model='whisper-1', file=audio_file, response_format='json', language='en'). Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm (max 25MB). Parameters: model (whisper-1), language (ISO-639-1), prompt (context for accuracy), response_format (json/text/srt/vtt/verbose_json with timestamps). Use for: meeting transcriptions, voice interfaces, accessibility, content creation. MIT license. Turbo model available for faster processing.
Rate limits (2025): RPM (requests/min), TPM (tokens/min), RPD (requests/day), TPD (tokens/day), IPM (images/min). Tier system: Tier 1 ($5 spent, GPT-4o: 500k TPM), Tier 2 ($50, 1M TPM), Tier 3 ($100, 2M TPM), Tier 4 ($250, 4M TPM), Tier 5 ($1000, 40M TPM). Auto-upgrade based on spend. Response headers: x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens. Error: 429 status code. Handling: exponential backoff wait = min(60, 2^retry + random(0,1)), max 5 retries. Python: use tenacity library. Optimization: (1) batch requests, (2) prompt caching, (3) use GPT-4o-mini for high volume, (4) monitor usage dashboard. Production essential: implement retry logic with jitter.
GPT-4o (multimodal flagship, 2025) and GPT-4 Turbo with vision analyze images. Python example: messages=[{'role': 'user', 'content': [{'type': 'text', 'text': 'What is in this image?'}, {'type': 'image_url', 'image_url': {'url': 'https://example.com/image.jpg', 'detail': 'high'}}]}]; response = client.chat.completions.create(model='gpt-4o', messages=messages). Image formats: URL or base64 data URI. Detail levels: 'low' (512px, faster, cheaper), 'high' (2048px, detailed analysis). Supports multiple images per message. Use for: OCR, chart analysis, visual Q&A, image description, diagram understanding. Max 20MB per image. Cost: GPT-4o $2.50-$10/1M tokens + image tokens. Essential for multimodal AI applications.
Structured outputs (2025) guarantee 100% schema adherence vs JSON mode which only ensures valid JSON. Use structured outputs with strict=True: response_format={'type': 'json_schema', 'json_schema': {'name': 'response', 'schema': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'age': {'type': 'number'}}, 'required': ['name', 'age'], 'additionalProperties': False}, 'strict': True}}. gpt-4o-2024-08-06 scores 100% on schema compliance vs <40% without strict mode. Also works with function calling: set strict: true in tools. JSON mode (legacy): response_format={'type': 'json_object'}, must mention 'JSON' in prompt. Use structured outputs for: API integration, data extraction, type-safe parsing. No manual validation needed. Production essential.
Fine-tuning (2025) customizes GPT-4o, GPT-4o-mini, GPT-3.5 Turbo on your data. Process: (1) Prepare JSONL: {'messages': [{'role': 'system', 'content': '...'}, {'role': 'user', 'content': '...'}, {'role': 'assistant', 'content': '...'}]}, 50+ examples minimum, 100-500+ ideal. (2) Upload: client.files.create(file=open('data.jsonl'), purpose='fine-tune'). (3) Create job: client.fine_tuning.jobs.create(training_file=file_id, model='gpt-4o-2024-08-06'). (4) Use: client.chat.completions.create(model='ft:gpt-4o-...'). Cost: GPT-4o training $25/1M tokens, hosted inference $3.75/1M input. Use for: consistent formatting, domain-specific tone, complex instructions. Alternative: try prompt engineering + few-shot first (often sufficient).
Moderation API (2025) detects harmful content before/after generation. Python example: response = client.moderations.create(model='omni-moderation-latest', input='text to check'); flagged = response.results[0].flagged. Categories: hate, harassment, self-harm, sexual, sexual/minors, violence, violence/graphic. Returns: flagged (bool), category_scores (0-1), categories (dict). Best practices: (1) moderate all user inputs before API call, (2) filter model outputs before display, (3) implement user reporting, (4) log violations for review, (5) follow OpenAI usage policies. Free endpoint. Safety layers: input validation → API call → output moderation → monitoring. Essential for user-facing applications to prevent policy violations and ensure safe UX.
API key security (2025) best practices: (1) Never commit to git - add .env to .gitignore. (2) Use environment variables: export OPENAI_API_KEY='sk-...', Python: os.getenv('OPENAI_API_KEY'). (3) Backend only - NEVER expose in frontend JavaScript. (4) Use secret managers: AWS Secrets Manager, Azure Key Vault, HashiCorp Vault. (5) Scope keys: create project-specific keys at platform.openai.com. (6) Rotate regularly: 90-day rotation policy. (7) Monitor usage: set spending limits, alert on anomalies. (8) Separate dev/prod keys. (9) Leaked key: revoke immediately via dashboard. (10) Use service accounts for production. Check usage per key in dashboard. Essential for preventing unauthorized access and cost overruns.
Model comparison (2025): GPT-4o (flagship, $2.50/1M input, $10/1M output, 128k context, 16k max output, multimodal text+vision, fastest GPT-4 intelligence, prompt caching, structured outputs) - use for most production workloads. GPT-4o-mini ($0.15/0.60 per 1M tokens, 128k context, text+vision, 80% cheaper) - use for high-volume simple tasks. GPT-4 Turbo (legacy, $10/30 per 1M, 128k context, vision) - superseded by GPT-4o. GPT-3.5 Turbo (legacy, $0.50/1.50 per 1M, 16k context) - use only for basic completions. o1/o1-mini (reasoning models, different pricing) - use for complex math/coding. Choose GPT-4o for 90% of use cases (best price/performance). Use GPT-4o-mini for cost optimization.
Conversation management: include all messages in messages array. Python pattern: conversation = [{'role': 'system', 'content': 'Be helpful'}]; conversation.append({'role': 'user', 'content': question}); response = client.chat.completions.create(model='gpt-4o', messages=conversation); conversation.append({'role': 'assistant', 'content': response.choices[0].message.content}). Challenges: 128k token limit, costs ($2.50/1M input). Strategies: (1) Sliding window: keep last 10 messages. (2) Summarization: summary = summarize(old_messages); messages = [system, summary] + recent_messages. (3) Important message retention: pin key context. (4) Prompt caching: system message cached (50% discount). Calculate tokens: use tiktoken. Clear context when topic changes. Use Assistants API for managed threads.
Logprobs (log probabilities, 2025) reveal model confidence per token. Python example: response = client.chat.completions.create(model='gpt-4o', messages=[...], logprobs=True, top_logprobs=3); for choice in response.choices: for token_data in choice.logprobs.content: print(f'{token_data.token}: {token_data.logprob}'). Parameters: logprobs=True enables, top_logprobs=N (1-20) returns N most likely alternatives per token. Response: token, logprob (e.g., -0.1 = very confident, -5.0 = unlikely), top_logprobs list. Use for: (1) confidence scoring (flag low-confidence outputs), (2) uncertainty quantification, (3) debugging hallucinations, (4) A/B testing prompts. Sequence probability: sum logprobs. Essential for production quality monitoring.
Assistants API v2 (2025, beta) provides stateful conversations with managed tools. Python example: assistant = client.beta.assistants.create(model='gpt-4o', instructions='You are helpful', tools=[{'type': 'file_search'}]); thread = client.beta.threads.create(); message = client.beta.threads.messages.create(thread_id=thread.id, role='user', content='Question'); run = client.beta.threads.runs.create(thread_id=thread.id, assistant_id=assistant.id). Built-in tools: file_search (10k files, $0.10/GB/day storage), code_interpreter (Python execution, $0.03/session), function_calling. Features: persistent threads, streaming, JSON mode, 256k char system instructions. Benefits: no manual context management, built-in RAG, stateful sessions. Use for: customer support, data analysis, complex workflows. Easier than managing conversation history manually.
Cost optimization (2025): (1) Model selection: GPT-4o-mini ($0.15/0.60 per 1M) for simple tasks, GPT-4o ($2.50/10) for complex. (2) Prompt caching: reuse system messages (auto 50% discount on cached tokens 1024+). (3) Reduce tokens: concise prompts, limit max_tokens to minimum needed. (4) Batch API: 50% discount for non-urgent tasks. (5) Embeddings: text-embedding-3-small ($0.02/1M) not large ($0.13/1M). (6) Cache responses: Redis/Memcached for frequent queries. (7) Monitor: set spending limits, track per-user costs. (8) Smart routing: simple queries → GPT-4o-mini, complex → GPT-4o. (9) Output tokens cost 4x input (GPT-4o), minimize unnecessary generation. (10) Use structured outputs to avoid retries. Track: usage dashboard, alert on anomalies.
Error handling (2025): HTTP codes: 400 (invalid request), 401 (auth failed), 429 (rate limit), 500 (server error), 503 (overloaded). Python retry with tenacity: @retry(wait=wait_exponential(multiplier=1, min=2, max=60), stop=stop_after_attempt(5), retry=retry_if_exception_type((RateLimitError, APIError))). Exponential backoff formula: wait = min(60, 2**retry_count + random.uniform(0,1)). Retry on: 429 (rate limit), 500/502/503 (server errors), timeouts. Don't retry: 400 (bad request), 401 (auth), 404 (not found). Parse error: response.json()['error']['message']. Monitor error rates, log for debugging. Use try/except: try: response = client.chat.completions.create(...) except RateLimitError: handle_rate_limit(). Production essential for reliability.
Latest features (2025): (1) GPT-4o flagship: multimodal, $2.50/10 per 1M tokens, 128k context, 16k max output. (2) o1/o1-mini reasoning models: 83% AIME math, advanced problem-solving. (3) Structured outputs: 100% schema adherence with strict=True. (4) Prompt caching: automatic 50% discount on 1024+ token prefixes, 80% latency reduction. (5) Batch API: 50% cost discount for async tasks. (6) Assistants API v2: file_search (10k files), code_interpreter, persistent threads. (7) Vision: GPT-4o analyzes images, OCR, charts. (8) Whisper v3: 12.4% error rate, 102 languages. (9) text-embedding-3: $0.02/1M (small), better quality. (10) Parallel function calling, tools parameter. Stay updated: platform.openai.com/docs/changelog.
OpenAI API provides access to AI models via REST API. Models (2025): GPT-4o (multimodal flagship, $2.50/1M input, $10/1M output tokens, 128k context), GPT-4o-mini (cost-effective, $0.15/1M input), o1 (advanced reasoning, multimodal, 83% AIME math), o1-mini (STEM-focused, 80% cheaper), GPT-4 Turbo (legacy, 128k context), text-embedding-3-small/large (semantic search, $0.02/$0.13 per 1M tokens), DALL-E 3 (image generation), Whisper v3 (speech-to-text, 102 languages), TTS (text-to-speech). Use official SDKs: pip install openai (Python 1.0+), npm install openai (Node.js). Authenticate via API keys in headers: Authorization: Bearer sk-.... Essential for production AI applications.
Chat Completions API generates text via POST /v1/chat/completions. Python example: client.chat.completions.create(model='gpt-4o', messages=[{'role': 'system', 'content': 'You are a helpful assistant'}, {'role': 'user', 'content': 'Hello'}]). Key parameters: model (gpt-4o/gpt-4o-mini/o1), temperature (0-2 randomness, 0=deterministic), max_tokens (output limit, up to 16,384 for GPT-4o), top_p (nucleus sampling), response_format (for JSON mode/structured outputs with strict: true). Response: choices[0].message.content. Streaming: stream=True for incremental SSE responses. Supports prompt caching (auto for 1024+ tokens, 50% cost reduction). 128k context window for GPT-4o. Essential for production chatbots.
Tokens are text chunks (words/subwords) that models process. English: ~4 chars per token, ~0.75 words per token. Both input and output tokens are billed separately. Token limits (2025): GPT-4o (128k context, 16k max output), o1 (128k context), GPT-4o-mini (128k context). Count tokens: pip install tiktoken, then encoding = tiktoken.encoding_for_model('gpt-4o'); tokens = encoding.encode(text); count = len(tokens). Pricing: GPT-4o input $2.50/1M, output $10/1M (4x more expensive). Optimize costs: (1) concise prompts, (2) limit max_tokens, (3) use prompt caching for repeated prefixes (50% discount), (4) choose GPT-4o-mini for simple tasks. Monitor via usage dashboard.
Function calling (now tools parameter, functions deprecated) enables models to call your functions with structured JSON arguments. Python example: tools=[{'type': 'function', 'function': {'name': 'get_weather', 'description': 'Get weather', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string'}}, 'required': ['location']}, 'strict': True}}]. Set strict=True for guaranteed schema adherence (100% accuracy vs <40% without). Model returns tool_calls array with id, name, arguments. Execute function, send result back: {'role': 'tool', 'tool_call_id': id, 'content': result}. Parallel tool calling supported (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo). o1/o1-mini don't support parallel calls. Use for: database queries, API integration, calculations.
Embeddings convert text into vector representations capturing semantic meaning. Models (2025): text-embedding-3-small ($0.02/1M tokens, 1,536 dims, best value), text-embedding-3-large ($0.13/1M tokens, 3,072 dims, highest quality). Python example: response = client.embeddings.create(model='text-embedding-3-small', input='Your text here'); vector = response.data[0].embedding. Max input: 8,192 tokens per call, batch up to 2,048 inputs. Dimensions configurable via encoding_format parameter. Use for: semantic search (RAG), clustering, recommendations, duplicate detection. Measure similarity: cosine_similarity(vec1, vec2). Store in: Pinecone, Weaviate, Qdrant, pgvector. Only input tokens billed (no output tokens).
Message roles structure conversations. System: sets AI behavior/personality (e.g., 'You are a helpful coding assistant'). Processed first, influences all responses. Max 256k chars. User: user input, questions. Assistant: model's previous responses (for conversation history). Python example: messages=[{'role': 'system', 'content': 'Be concise'}, {'role': 'user', 'content': 'What is Python?'}, {'role': 'assistant', 'content': 'Programming language'}, {'role': 'user', 'content': 'Tell me more'}]. Best practices: (1) clear system message for consistent behavior, (2) include conversation history for context, (3) system message optional but recommended. Prompt caching benefits: system messages reused across calls get 50% discount on cached tokens.
Streaming returns incremental responses via Server-Sent Events (SSE) for lower latency. Python example: stream = client.chat.completions.create(model='gpt-4o', messages=[...], stream=True); for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end=''). Response: series of chunks with delta.content. Final chunk has finish_reason='stop'. Benefits: (1) progressive UI updates, (2) perceived latency reduction, (3) better UX for long responses. TypeScript: use async iterators. Works with Chat Completions API. Use for: chatbots, real-time apps, streaming interfaces. SDKs handle SSE parsing automatically. Essential for responsive production applications.
Temperature (0-2, default 1) controls randomness in token selection. 0: deterministic, always picks highest probability token (use for factual Q&A, code generation, structured data). 0.7: balanced creativity and coherence (content generation). 1.5-2: highly creative, diverse outputs (brainstorming, creative writing). Python example: client.chat.completions.create(model='gpt-4o', temperature=0, seed=123, messages=[...]). Set seed parameter with temperature=0 for reproducible outputs across calls. Alternative: top_p (nucleus sampling, 0-1, typically 0.1 or 0.9). Don't modify both temperature and top_p simultaneously. Use temperature for most use cases. Critical for controlling output determinism vs creativity.
RAG combines retrieval with generation for accurate, up-to-date responses. Implementation: (1) Embed knowledge base: embeddings = client.embeddings.create(model='text-embedding-3-small', input=docs), store in vector DB (Pinecone, Qdrant, pgvector, Weaviate). (2) Query: embed user question, retrieve top-k similar docs via cosine similarity. (3) Augment prompt: context = '\n'.join(retrieved_docs); messages=[{'role': 'system', 'content': f'Answer using: {context}'}, {'role': 'user', 'content': question}]. (4) Generate: client.chat.completions.create(model='gpt-4o', messages=messages). Benefits: reduces hallucinations, provides source citations, handles dynamic data. Use text-embedding-3-small ($0.02/1M) for cost efficiency. Essential for Q&A, documentation search, customer support.
Prompt engineering best practices (2025): (1) Be specific: 'Write Python function to validate email using regex' not 'Write code'. (2) Few-shot learning: provide 2-3 examples of desired output. (3) Use delimiters: triple quotes for user input, XML tags for sections. (4) Specify format: 'Return JSON with keys: name, age, city'. (5) Chain-of-thought: 'Let's think step by step' for reasoning tasks (esp. o1 models). (6) System messages: set behavior/constraints. (7) Use structured outputs: response_format={'type': 'json_schema', 'json_schema': schema, 'strict': True} for guaranteed schema. (8) Test temperatures: 0 for factual, 0.7 for creative. Avoid: vague instructions, assuming knowledge. Use o1 for complex reasoning. Essential for production quality.
Whisper API (v3, 2025) transcribes/translates audio with 12.4% avg error rate across 102 languages. Endpoints: POST /v1/audio/transcriptions (same language) or /v1/audio/translations (to English). Python example: audio_file = open('audio.mp3', 'rb'); transcript = client.audio.transcriptions.create(model='whisper-1', file=audio_file, response_format='json', language='en'). Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm (max 25MB). Parameters: model (whisper-1), language (ISO-639-1), prompt (context for accuracy), response_format (json/text/srt/vtt/verbose_json with timestamps). Use for: meeting transcriptions, voice interfaces, accessibility, content creation. MIT license. Turbo model available for faster processing.
Rate limits (2025): RPM (requests/min), TPM (tokens/min), RPD (requests/day), TPD (tokens/day), IPM (images/min). Tier system: Tier 1 ($5 spent, GPT-4o: 500k TPM), Tier 2 ($50, 1M TPM), Tier 3 ($100, 2M TPM), Tier 4 ($250, 4M TPM), Tier 5 ($1000, 40M TPM). Auto-upgrade based on spend. Response headers: x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens. Error: 429 status code. Handling: exponential backoff wait = min(60, 2^retry + random(0,1)), max 5 retries. Python: use tenacity library. Optimization: (1) batch requests, (2) prompt caching, (3) use GPT-4o-mini for high volume, (4) monitor usage dashboard. Production essential: implement retry logic with jitter.
GPT-4o (multimodal flagship, 2025) and GPT-4 Turbo with vision analyze images. Python example: messages=[{'role': 'user', 'content': [{'type': 'text', 'text': 'What is in this image?'}, {'type': 'image_url', 'image_url': {'url': 'https://example.com/image.jpg', 'detail': 'high'}}]}]; response = client.chat.completions.create(model='gpt-4o', messages=messages). Image formats: URL or base64 data URI. Detail levels: 'low' (512px, faster, cheaper), 'high' (2048px, detailed analysis). Supports multiple images per message. Use for: OCR, chart analysis, visual Q&A, image description, diagram understanding. Max 20MB per image. Cost: GPT-4o $2.50-$10/1M tokens + image tokens. Essential for multimodal AI applications.
Structured outputs (2025) guarantee 100% schema adherence vs JSON mode which only ensures valid JSON. Use structured outputs with strict=True: response_format={'type': 'json_schema', 'json_schema': {'name': 'response', 'schema': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'age': {'type': 'number'}}, 'required': ['name', 'age'], 'additionalProperties': False}, 'strict': True}}. gpt-4o-2024-08-06 scores 100% on schema compliance vs <40% without strict mode. Also works with function calling: set strict: true in tools. JSON mode (legacy): response_format={'type': 'json_object'}, must mention 'JSON' in prompt. Use structured outputs for: API integration, data extraction, type-safe parsing. No manual validation needed. Production essential.
Fine-tuning (2025) customizes GPT-4o, GPT-4o-mini, GPT-3.5 Turbo on your data. Process: (1) Prepare JSONL: {'messages': [{'role': 'system', 'content': '...'}, {'role': 'user', 'content': '...'}, {'role': 'assistant', 'content': '...'}]}, 50+ examples minimum, 100-500+ ideal. (2) Upload: client.files.create(file=open('data.jsonl'), purpose='fine-tune'). (3) Create job: client.fine_tuning.jobs.create(training_file=file_id, model='gpt-4o-2024-08-06'). (4) Use: client.chat.completions.create(model='ft:gpt-4o-...'). Cost: GPT-4o training $25/1M tokens, hosted inference $3.75/1M input. Use for: consistent formatting, domain-specific tone, complex instructions. Alternative: try prompt engineering + few-shot first (often sufficient).
Moderation API (2025) detects harmful content before/after generation. Python example: response = client.moderations.create(model='omni-moderation-latest', input='text to check'); flagged = response.results[0].flagged. Categories: hate, harassment, self-harm, sexual, sexual/minors, violence, violence/graphic. Returns: flagged (bool), category_scores (0-1), categories (dict). Best practices: (1) moderate all user inputs before API call, (2) filter model outputs before display, (3) implement user reporting, (4) log violations for review, (5) follow OpenAI usage policies. Free endpoint. Safety layers: input validation → API call → output moderation → monitoring. Essential for user-facing applications to prevent policy violations and ensure safe UX.
API key security (2025) best practices: (1) Never commit to git - add .env to .gitignore. (2) Use environment variables: export OPENAI_API_KEY='sk-...', Python: os.getenv('OPENAI_API_KEY'). (3) Backend only - NEVER expose in frontend JavaScript. (4) Use secret managers: AWS Secrets Manager, Azure Key Vault, HashiCorp Vault. (5) Scope keys: create project-specific keys at platform.openai.com. (6) Rotate regularly: 90-day rotation policy. (7) Monitor usage: set spending limits, alert on anomalies. (8) Separate dev/prod keys. (9) Leaked key: revoke immediately via dashboard. (10) Use service accounts for production. Check usage per key in dashboard. Essential for preventing unauthorized access and cost overruns.
Model comparison (2025): GPT-4o (flagship, $2.50/1M input, $10/1M output, 128k context, 16k max output, multimodal text+vision, fastest GPT-4 intelligence, prompt caching, structured outputs) - use for most production workloads. GPT-4o-mini ($0.15/0.60 per 1M tokens, 128k context, text+vision, 80% cheaper) - use for high-volume simple tasks. GPT-4 Turbo (legacy, $10/30 per 1M, 128k context, vision) - superseded by GPT-4o. GPT-3.5 Turbo (legacy, $0.50/1.50 per 1M, 16k context) - use only for basic completions. o1/o1-mini (reasoning models, different pricing) - use for complex math/coding. Choose GPT-4o for 90% of use cases (best price/performance). Use GPT-4o-mini for cost optimization.
Conversation management: include all messages in messages array. Python pattern: conversation = [{'role': 'system', 'content': 'Be helpful'}]; conversation.append({'role': 'user', 'content': question}); response = client.chat.completions.create(model='gpt-4o', messages=conversation); conversation.append({'role': 'assistant', 'content': response.choices[0].message.content}). Challenges: 128k token limit, costs ($2.50/1M input). Strategies: (1) Sliding window: keep last 10 messages. (2) Summarization: summary = summarize(old_messages); messages = [system, summary] + recent_messages. (3) Important message retention: pin key context. (4) Prompt caching: system message cached (50% discount). Calculate tokens: use tiktoken. Clear context when topic changes. Use Assistants API for managed threads.
Logprobs (log probabilities, 2025) reveal model confidence per token. Python example: response = client.chat.completions.create(model='gpt-4o', messages=[...], logprobs=True, top_logprobs=3); for choice in response.choices: for token_data in choice.logprobs.content: print(f'{token_data.token}: {token_data.logprob}'). Parameters: logprobs=True enables, top_logprobs=N (1-20) returns N most likely alternatives per token. Response: token, logprob (e.g., -0.1 = very confident, -5.0 = unlikely), top_logprobs list. Use for: (1) confidence scoring (flag low-confidence outputs), (2) uncertainty quantification, (3) debugging hallucinations, (4) A/B testing prompts. Sequence probability: sum logprobs. Essential for production quality monitoring.
Assistants API v2 (2025, beta) provides stateful conversations with managed tools. Python example: assistant = client.beta.assistants.create(model='gpt-4o', instructions='You are helpful', tools=[{'type': 'file_search'}]); thread = client.beta.threads.create(); message = client.beta.threads.messages.create(thread_id=thread.id, role='user', content='Question'); run = client.beta.threads.runs.create(thread_id=thread.id, assistant_id=assistant.id). Built-in tools: file_search (10k files, $0.10/GB/day storage), code_interpreter (Python execution, $0.03/session), function_calling. Features: persistent threads, streaming, JSON mode, 256k char system instructions. Benefits: no manual context management, built-in RAG, stateful sessions. Use for: customer support, data analysis, complex workflows. Easier than managing conversation history manually.
Cost optimization (2025): (1) Model selection: GPT-4o-mini ($0.15/0.60 per 1M) for simple tasks, GPT-4o ($2.50/10) for complex. (2) Prompt caching: reuse system messages (auto 50% discount on cached tokens 1024+). (3) Reduce tokens: concise prompts, limit max_tokens to minimum needed. (4) Batch API: 50% discount for non-urgent tasks. (5) Embeddings: text-embedding-3-small ($0.02/1M) not large ($0.13/1M). (6) Cache responses: Redis/Memcached for frequent queries. (7) Monitor: set spending limits, track per-user costs. (8) Smart routing: simple queries → GPT-4o-mini, complex → GPT-4o. (9) Output tokens cost 4x input (GPT-4o), minimize unnecessary generation. (10) Use structured outputs to avoid retries. Track: usage dashboard, alert on anomalies.
Error handling (2025): HTTP codes: 400 (invalid request), 401 (auth failed), 429 (rate limit), 500 (server error), 503 (overloaded). Python retry with tenacity: @retry(wait=wait_exponential(multiplier=1, min=2, max=60), stop=stop_after_attempt(5), retry=retry_if_exception_type((RateLimitError, APIError))). Exponential backoff formula: wait = min(60, 2**retry_count + random.uniform(0,1)). Retry on: 429 (rate limit), 500/502/503 (server errors), timeouts. Don't retry: 400 (bad request), 401 (auth), 404 (not found). Parse error: response.json()['error']['message']. Monitor error rates, log for debugging. Use try/except: try: response = client.chat.completions.create(...) except RateLimitError: handle_rate_limit(). Production essential for reliability.
Latest features (2025): (1) GPT-4o flagship: multimodal, $2.50/10 per 1M tokens, 128k context, 16k max output. (2) o1/o1-mini reasoning models: 83% AIME math, advanced problem-solving. (3) Structured outputs: 100% schema adherence with strict=True. (4) Prompt caching: automatic 50% discount on 1024+ token prefixes, 80% latency reduction. (5) Batch API: 50% cost discount for async tasks. (6) Assistants API v2: file_search (10k files), code_interpreter, persistent threads. (7) Vision: GPT-4o analyzes images, OCR, charts. (8) Whisper v3: 12.4% error rate, 102 languages. (9) text-embedding-3: $0.02/1M (small), better quality. (10) Parallel function calling, tools parameter. Stay updated: platform.openai.com/docs/changelog.