Developer Guide
Complete guide to integrating Promptly into your application. Promptly is an OpenAI-compatible proxy that optimizes every LLM request - reducing token usage, lowering costs, and improving response times with zero code changes.
Quick Start
Get up and running in 3 minutes.
Step 1: Create an Account
curl -X POST https://api.getpromptly.in/api/auth/register \
-H "Content-Type: application/json" \
-d '{
"org_name": "My Company",
"email": "dev@mycompany.com",
"password": "securepassword",
"account_type": "team"
}'Response:
{
"token": "eyJhbGciOiJIUz...",
"user": { "id": "uuid", "email": "dev@mycompany.com", "role": "admin" },
"org": { "id": "uuid", "name": "My Company", "account_type": "team" },
"api_key": "sk-promptly-abc123..."
}api_key - it's only shown once. This is your Promptly key.Step 2: Connect Your Provider Key
Go to the dashboard → Keys → Add your OpenAI / Anthropic / Google API key. Or via API:
curl -X POST https://api.getpromptly.in/api/keys/providers \
-H "Authorization: Bearer <your-jwt-token>" \
-H "Content-Type: application/json" \
-d '{
"provider": "openai",
"api_key": "sk-your-openai-key"
}'Promptly encrypts and stores your provider key securely.
Step 3: Use Promptly in Your App
import openai
client = openai.OpenAI(
api_key="sk-promptly-abc123...", # Your Promptly key
base_url="https://api.getpromptly.in/v1" # Point to Promptly
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantum computing?"}
]
)
print(response.choices[0].message.content)How It Works
Your App Promptly LLM Provider
│ │ │
│──── POST /v1/chat/... ─────▶│ │
│ (Promptly API key) │ │
│ │── 1. Validate API key │
│ │── 2. Check semantic cache │
│ │ (hit? return instant) │
│ │── 3. Optimize prompt │
│ │ • Whitespace removal │
│ │ • Redundancy dedup │
│ │ • System compression │
│ │ • Context pruning │
│ │── 4. Classify complexity │
│ │ → Pick cheapest model │
│ │── 5. Forward to provider ───▶│
│ │ (your OpenAI key) │──► LLM processes
│ │◀── 6. Receive response ──────│
│ │── 7. Cache response │
│ │── 8. Log everything │
│ │── 9. Evaluate alerts │
│◀── Response + metadata ─────│ │Authentication
Promptly uses two separate authentication systems:
Proxy Requests (your app → Promptly)
Use your Promptly API key as a Bearer token:
Authorization: Bearer sk-promptly-abc123...This key authenticates proxy requests to /v1/* endpoints. It identifies your organization and loads your provider keys + optimization config.
Dashboard API (managing settings)
Use the JWT token returned at login/register:
Authorization: Bearer eyJhbGciOiJIUz...This authenticates dashboard API requests (/api/* endpoints) - managing keys, viewing analytics, configuring optimization, etc.
OAuth Login
Promptly also supports OAuth via GitHub and Google. These are redirect-based flows:
# GitHub OAuth - redirects to GitHub authorization page
GET /api/auth/oauth/github
# GitHub callback - exchanges code for JWT
GET /api/auth/oauth/github/callback?code=xxx
# Google OAuth - redirects to Google authorization page
GET /api/auth/oauth/google
# Google callback - exchanges code for JWT
GET /api/auth/oauth/google/callback?code=xxxAfter successful OAuth, you are redirected to the frontend with your JWT token.
Proxy Endpoint
POST /v1/chat/completions
The core endpoint. Fully compatible with the OpenAI Chat Completions API.
{
"model": "gpt-4o",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Explain recursion in 2 sentences." }
],
"temperature": 0.7,
"max_tokens": 500,
"stream": false
}Supported Parameters
| Parameter | Type | Description |
|---|---|---|
| model | string | Requested model (may be overridden by routing) |
| messages | array | Array of message objects (role + content) |
| temperature | float | Sampling temperature (0-2) |
| max_tokens | int | Maximum tokens in response |
| top_p | float | Nucleus sampling |
| frequency_penalty | float | Frequency penalty (-2 to 2) |
| presence_penalty | float | Presence penalty (-2 to 2) |
| stop | string/array | Stop sequences |
| stream | bool | Enable SSE streaming |
| n | int | Number of completions |
| user | string | End-user identifier |
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1709312000,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Recursion is when a function calls itself..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 28,
"completion_tokens": 45,
"total_tokens": 73
},
"promptly_metadata": {
"baseline_model": "gpt-4o",
"routed_model": "gpt-4o-mini",
"original_tokens": 52,
"optimized_tokens": 28,
"savings": 0.00034,
"cache_hit": false,
"latency_ms": 834
}
}Notice promptly_metadata in the response - this tells you what Promptly did: the request asked for gpt-4o but routing classified it as simple → used gpt-4o-mini. Prompt optimization reduced 52 → 28 tokens (46% saving). Total cost savings: $0.00034.
SDK Integration
Use our official SDK for the fastest setup, or change base_url in any OpenAI-compatible SDK.
Promptly Python SDK
pip install promptly-sdkfrom promptly import Promptly
client = Promptly(api_key="sk-promptly-abc123...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Promptly Node.js SDK
npm install promptly-sdkimport Promptly from "promptly-sdk";
const client = new Promptly({ apiKey: "sk-promptly-abc123..." });
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);Wrap an Existing Client
Already have an OpenAI client? Wrap it to route through Promptly without changing any other code:
from promptly import wrap
from openai import OpenAI
client = wrap(OpenAI(api_key="sk-promptly-abc123..."))
# All existing code works unchangedAlternative: Base URL method
You can also use any OpenAI-compatible SDK by changing base_url and api_key.
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
api_key="sk-promptly-abc123...",
base_url="https://api.getpromptly.in/v1",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)Node.js (openai SDK)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "sk-promptly-abc123...",
baseURL: "https://api.getpromptly.in/v1",
});
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello!" }],
});cURL
curl https://api.getpromptly.in/v1/chat/completions \
-H "Authorization: Bearer sk-promptly-abc123..." \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'LangChain (Python)
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o",
api_key="sk-promptly-abc123...",
base_url="https://api.getpromptly.in/v1",
)
response = llm.invoke("Explain quantum computing.")LlamaIndex
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o",
api_key="sk-promptly-abc123...",
api_base="https://api.getpromptly.in/v1",
)Vercel AI SDK (TypeScript)
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";
const result = await generateText({
model: openai("gpt-4o", {
baseURL: "https://api.getpromptly.in/v1",
apiKey: "sk-promptly-abc123...",
}),
prompt: "Explain recursion.",
});Streaming
Promptly fully supports Server-Sent Events (SSE) streaming - the same format as OpenAI.
Python
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a poem about code."}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)Node.js
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Write a poem about code." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}Raw SSE Format
data: {"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Optimization Settings
Control how aggressively Promptly optimizes your prompts.
Get Current Settings
GET /api/optimization/settingsUpdate Settings
{
"level": "moderate",
"whitespace_removal": true,
"redundancy_elimination": true,
"system_prompt_compression": true,
"context_pruning": true,
"cache_enabled": true,
"cache_similarity_threshold": 0.95,
"cache_ttl_seconds": 3600,
"routing_mode": "auto"
}Optimization Levels
| Level | What It Does | Expected Savings |
|---|---|---|
| conservative | Whitespace only, 20-message context window | 10-20% |
| moderate (default) | All optimizations, 10-message window + system dedup | 30-50% |
| aggressive | All optimizations, 6-message window + summary injection | 50-70% |
| off | No optimization - raw passthrough | 0% |
Individual Feature Toggles
| Feature | Default | What It Does |
|---|---|---|
| whitespace_removal | true | Normalizes whitespace, removes extra newlines/spaces |
| redundancy_elimination | true | Deduplicates repeated phrases and instructions |
| system_prompt_compression | true | Compresses verbose system prompts while keeping intent |
| context_pruning | true | Trims old conversation turns |
| cache_enabled | true | Enables semantic caching |
Smart Routing
When routing_mode is set to "auto", Promptly analyzes each request and picks the cheapest model capable of handling it.
How Classification Works
Promptly uses AI-powered classification to analyze every request on a complexity scale. It considers factors like prompt length, intent, and content type to determine the optimal model tier:
| Input Pattern | Classification |
|---|---|
| Short, straightforward prompts | → simple |
| Moderate length with clear intent | → medium |
| Long prompts, code analysis, multi-step reasoning | → complex |
Model Selection by Provider
| Complexity | OpenAI | Anthropic | |
|---|---|---|---|
| simple | gpt-4o-mini | claude-haiku | gemini-2.0-flash |
| medium | gpt-4o | claude-sonnet | gemini-2.0-flash |
| complex | gpt-4o | claude-opus | gemini-2.0-pro |
Routing Modes
| Mode | Behavior |
|---|---|
| auto | Classifies complexity → picks cheapest capable model |
| manual | Always uses the model specified in your request |
Fallback
If the routed model fails, Promptly automatically retries with the original model and unoptimized prompt. The response will include fallback_used: true in the logs.
Semantic Caching
When semantic caching is enabled, Promptly uses embedding-based similarity matching to detect when a user asks something similar to a previous request.
How It Works
Promptly uses a two-layer caching system:
1. Exact match - Identical prompts are matched instantly.
2. Semantic similarity - Similar prompts are matched using AI-powered embedding comparison.
3. If similarity ≥ threshold (default: 0.95), the cached response is returned instantly.
4. If it's a miss, the response is cached after the LLM call.
Cache Hit Behavior
Cost: $0 - no LLM call is made. Latency: ~2-50ms - exact matches are near-instant, semantic lookups take slightly longer. Logged as a cache hit in your analytics.
Configuration
| Setting | Default | Description |
|---|---|---|
| cache_enabled | true | Enable/disable caching |
| cache_similarity_threshold | 0.95 | Minimum cosine similarity (0.0 - 1.0). Lower = more hits, less precision |
| cache_ttl_seconds | 3600 | How long cached responses live (seconds) |
Cache Management
# View cache statistics
GET /api/optimization/cache/stats
# Response: { "exact_entries": 812, "semantic_entries": 435, "total_entries": 1247 }
# Clear all cached responses
POST /api/optimization/cache/clearExamples
Request 1: "What's the capital of France?" → Miss → calls LLM → caches
Request 2: "capital of france?" → Hit (0.97 similarity) → $0
Request 3: "What is France's capital city?" → Hit (0.96 similarity) → $0
Request 4: "What's the capital of Germany?" → Miss (0.72 similarity) → calls LLMContext Pruning
Long conversations waste tokens on stale context. Context pruning trims old turns automatically based on your optimization level.
Behavior by Level
| Level | Window Size | System Dedup | Summary Injection |
|---|---|---|---|
| conservative | Last 20 messages | No | No |
| moderate | Last 10 messages | Yes - removes duplicate system messages | No |
| aggressive | Last 6 messages | Yes | Yes - injects a one-line summary of dropped messages |
Example
# 48-message conversation with aggressive pruning:
Before: 4,200 tokens (full history)
After: 980 tokens (6 latest messages + summary of dropped ones)
# Summary injected as system message:
"[Earlier context pruned: 8 user and 7 assistant messages were removed to save tokens.]"Context pruning runs only on the messages array. The most recent messages are always preserved. System messages are kept (but deduplicated on moderate/aggressive).
Custom Routing Rules
Override auto-routing with custom rules that are checked before classification.
Create a Rule
{
"name": "Code tasks use GPT-4o",
"condition_type": "keyword",
"condition_value": { "keywords": ["code", "debug", "refactor"] },
"action": "use_model",
"action_value": "gpt-4o",
"priority": 10
}Condition Types
| Type | Value Format | Description |
|---|---|---|
| keyword | {"keywords": ["word1", "word2"]} | Matches if any keyword appears in the messages |
| token_count | {"operator": "gt", "value": 2000} | Matches on approximate token count (gt or lt) |
| always | {} | Matches every request (useful as a catch-all) |
Action Types
| Action | Value | Description |
|---|---|---|
| use_model | "gpt-4o-mini" | Route directly to a specific model |
| force_tier | "simple" | Force a complexity tier (simple/medium/complex) |
Examples
{
"name": "Short prompts → mini",
"condition_type": "token_count",
"condition_value": { "operator": "lt", "value": 50 },
"action": "use_model",
"action_value": "gpt-4o-mini",
"priority": 20
}{
"name": "Default everything to Sonnet",
"condition_type": "always",
"condition_value": {},
"action": "use_model",
"action_value": "claude-sonnet",
"priority": 0
}List / Delete Rules
# List all rules
GET /api/routing/rules
# Delete a rule
DELETE /api/routing/rules/{rule_id}Alerts
Set up alerts when metrics cross thresholds.
Create an Alert Rule
{
"name": "High daily spend",
"metric": "daily_spend",
"operator": "gt",
"threshold": 50.00,
"channels": ["email"]
}Available Metrics
| Metric | Description |
|---|---|
| daily_spend | Total cost for the day ($) |
| error_rate | Percentage of failed requests |
| latency_p95 | 95th percentile latency (ms) |
| savings_percentage | Optimization savings rate (%) |
Operators
| Operator | Description |
|---|---|
| gt | Greater than |
| lt | Less than |
| gte | Greater than or equal |
| lte | Less than or equal |
Manage Alerts
# List all alerts
GET /api/alerts/rules
# Update an alert
PUT /api/alerts/rules/{rule_id}
{ "threshold": 100.00, "is_active": true }
# Delete an alert
DELETE /api/alerts/rules/{rule_id}Alerts are evaluated automatically in the background on a recurring schedule.
Analytics & Request Logs
Dashboard Overview
GET /api/analytics/overviewReturns total requests, costs, savings, cache hit rate, and other aggregate stats.
Savings Over Time
GET /api/analytics/savingsModel Distribution
GET /api/analytics/modelsSavings by Feature
GET /api/analytics/leversBreaks down savings by optimization feature (whitespace, redundancy, compression, caching, routing).
Request Logs
Every proxy request is logged with full detail:
# List requests (paginated)
GET /api/requests
# Get a specific request
GET /api/requests/{request_id}Each log entry includes: baseline model vs. routed model, original tokens vs. optimized tokens, cost breakdown and savings, cache hit status, optimizations applied, latency (total, proxy overhead, provider time), and success / fallback / error status.
Team Management
Team accounts (account_type: "team") can manage multiple users.
Invite a Member
{
"email": "teammate@company.com",
"role": "viewer"
}Roles
| Role | Permissions |
|---|---|
| admin | Full access - manage keys, settings, team, billing |
| member | Standard access - manage own key, view analytics, make requests |
| viewer | Read-only - view dashboard, analytics, request logs |
Manage Members
# List team members
GET /api/team/members
# Update a member's role
PUT /api/team/members/{user_id}
{ "role": "admin" }
# Remove a member
DELETE /api/team/members/{user_id}All team members operate under the same organization - sharing API keys, analytics, and settings.
API Reference
Base URLs
| Environment | URL |
|---|---|
| Production | https://api.getpromptly.in |
| Local dev | http://localhost:8000 |
Proxy
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /v1/chat/completions | API Key | Chat completions (OpenAI-compatible) |
| POST | /v1/completions | API Key | Legacy endpoint (redirects to chat) |
Auth
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /api/auth/register | None | Register org → returns JWT + API key |
| POST | /api/auth/login | None | Login → returns JWT |
| GET | /api/auth/oauth/github | None | GitHub OAuth redirect |
| GET | /api/auth/oauth/google | None | Google OAuth redirect |
Keys
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/keys/platform | JWT | List Promptly API keys |
| POST | /api/keys/platform/regenerate | JWT | Regenerate API key |
| GET | /api/keys/providers | JWT | List provider keys |
| POST | /api/keys/providers | JWT | Add provider key |
| PUT | /api/keys/providers/{id} | JWT | Update provider key |
| DELETE | /api/keys/providers/{id} | JWT | Delete provider key |
Optimization
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/optimization/settings | JWT | Get optimization config |
| PUT | /api/optimization/settings | JWT | Update optimization config |
| GET | /api/optimization/cache/stats | JWT | Cache statistics |
| POST | /api/optimization/cache/clear | JWT | Clear cache |
Routing
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/routing/config | JWT | Get routing config |
| PUT | /api/routing/config | JWT | Update routing mode |
| GET | /api/routing/rules | JWT | List custom rules |
| POST | /api/routing/rules | JWT | Create rule |
| DELETE | /api/routing/rules/{id} | JWT | Delete rule |
Analytics
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/analytics/overview | JWT | Dashboard overview |
| GET | /api/analytics/savings | JWT | Savings time series |
| GET | /api/analytics/models | JWT | Model usage breakdown |
| GET | /api/analytics/levers | JWT | Savings by feature |
Requests
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/requests | JWT | List request logs |
| GET | /api/requests/{id} | JWT | Get request detail |
Team
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/team/members | JWT | List members |
| POST | /api/team/invite | JWT | Invite member |
| PUT | /api/team/members/{id} | JWT | Update role |
| DELETE | /api/team/members/{id} | JWT | Remove member |
Alerts
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/alerts/rules | JWT | List alert rules |
| POST | /api/alerts/rules | JWT | Create alert |
| PUT | /api/alerts/rules/{id} | JWT | Update alert |
| DELETE | /api/alerts/rules/{id} | JWT | Delete alert |
Organization
| Method | Endpoint | Auth | Description |
|---|---|---|---|
| GET | /api/org | JWT | Get org details |
| PUT | /api/org | JWT | Update org |
Error Handling
Promptly returns standard HTTP error codes with structured error bodies:
{
"detail": "No valid provider key found. Please connect an API key in the dashboard."
}Common Errors
| Code | Cause |
|---|---|
| 401 Unauthorized | Missing, invalid, or inactive API key |
| 400 Bad Request | No provider key configured for the org |
| 409 Conflict | Email already registered |
| 429 Too Many Requests | Rate limit exceeded |
| 502 Bad Gateway | Upstream provider (OpenAI, etc.) returned an error |
Provider Failures + Fallback
If the routed model fails, Promptly automatically: (1) Falls back to the originally requested model, (2) Uses the original unoptimized prompt, (3) Retries once, and (4) Logs the event with status: "fallback".
Rate Limits
Promptly enforces per-key rate limiting:
| Limit | Default |
|---|---|
| Requests per minute | 500 |
When exceeded, you'll receive a 429 Too Many Requests response. Rate limits apply per Promptly API key (i.e., per organization).
FAQ
Does Promptly store my prompts or responses?
Promptly logs request metadata (model, tokens, cost, latency) for analytics. Cached responses are stored with a configurable TTL and can be cleared at any time. Provider API keys are encrypted at rest.
Can I use my Anthropic or Google key instead of OpenAI?
Yes. Promptly supports OpenAI, Anthropic, and Google. You connect whichever provider key you want in the dashboard. Your app always talks to Promptly in OpenAI format - Promptly translates to the correct provider format behind the scenes.
Does optimization change the quality of responses?
No. Optimization removes redundant tokens (extra whitespace, repeated instructions, verbose phrasing) without changing the semantic meaning. The LLM receives the same intent with fewer tokens.
What happens if I send a request without connecting a provider key?
You'll get a 400 Bad Request:
{ "detail": "No valid provider key found. Please connect an API key in the dashboard." }Can I disable all optimization and use Promptly as a pure proxy?
Yes. Set level to "off" and routing_mode to "manual":
PUT /api/optimization/settings
{ "level": "off", "routing_mode": "manual", "cache_enabled": false }How is my provider API key stored?
Provider keys are encrypted using industry-standard encryption. The raw key is never stored - only encrypted references.
What's the difference between Personal and Team accounts?
Both get the same proxy, optimization, and routing features. Team accounts add multi-user support with role-based access (admin/viewer), invite flow, and shared analytics across team members.