Developer Guide

Complete guide to integrating Promptly into your application. Promptly is an OpenAI-compatible proxy that optimizes every LLM request - reducing token usage, lowering costs, and improving response times with zero code changes.

Quick Start

Get up and running in 3 minutes.

Step 1: Create an Account

Register
curl -X POST https://api.getpromptly.in/api/auth/register \
  -H "Content-Type: application/json" \
  -d '{
    "org_name": "My Company",
    "email": "dev@mycompany.com",
    "password": "securepassword",
    "account_type": "team"
  }'

Response:

json
{
  "token": "eyJhbGciOiJIUz...",
  "user": { "id": "uuid", "email": "dev@mycompany.com", "role": "admin" },
  "org": { "id": "uuid", "name": "My Company", "account_type": "team" },
  "api_key": "sk-promptly-abc123..."
}
Important: Save the api_key - it's only shown once. This is your Promptly key.

Step 2: Connect Your Provider Key

Go to the dashboard → Keys → Add your OpenAI / Anthropic / Google API key. Or via API:

Add provider key
curl -X POST https://api.getpromptly.in/api/keys/providers \
  -H "Authorization: Bearer <your-jwt-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "api_key": "sk-your-openai-key"
  }'

Promptly encrypts and stores your provider key securely.

Step 3: Use Promptly in Your App

Python
import openai

client = openai.OpenAI(
    api_key="sk-promptly-abc123...",       # Your Promptly key
    base_url="https://api.getpromptly.in/v1"  # Point to Promptly
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ]
)

print(response.choices[0].message.content)
That's it. Every request is now automatically optimized, cached, and logged.

How It Works

Request Flow
Your App                      Promptly                      LLM Provider
   │                             │                              │
   │──── POST /v1/chat/... ─────▶│                              │
   │     (Promptly API key)      │                              │
   │                             │── 1. Validate API key        │
   │                             │── 2. Check semantic cache    │
   │                             │      (hit? return instant)   │
   │                             │── 3. Optimize prompt         │
   │                             │      • Whitespace removal    │
   │                             │      • Redundancy dedup      │
   │                             │      • System compression    │
   │                             │      • Context pruning       │
   │                             │── 4. Classify complexity     │
   │                             │      → Pick cheapest model   │
   │                             │── 5. Forward to provider ───▶│
   │                             │      (your OpenAI key)       │──► LLM processes
   │                             │◀── 6. Receive response ──────│
   │                             │── 7. Cache response          │
   │                             │── 8. Log everything          │
   │                             │── 9. Evaluate alerts         │
   │◀── Response + metadata ─────│                              │
Key insight: Your provider key gets charged for the actual LLM call. Promptly never touches your LLM billing - it just reduces how many tokens get sent and picks cheaper models when possible.

Authentication

Promptly uses two separate authentication systems:

Proxy Requests (your app → Promptly)

Use your Promptly API key as a Bearer token:

Authorization: Bearer sk-promptly-abc123...

This key authenticates proxy requests to /v1/* endpoints. It identifies your organization and loads your provider keys + optimization config.

Dashboard API (managing settings)

Use the JWT token returned at login/register:

Authorization: Bearer eyJhbGciOiJIUz...

This authenticates dashboard API requests (/api/* endpoints) - managing keys, viewing analytics, configuring optimization, etc.

OAuth Login

Promptly also supports OAuth via GitHub and Google. These are redirect-based flows:

bash
# GitHub OAuth - redirects to GitHub authorization page
GET /api/auth/oauth/github
# GitHub callback - exchanges code for JWT
GET /api/auth/oauth/github/callback?code=xxx

# Google OAuth - redirects to Google authorization page
GET /api/auth/oauth/google
# Google callback - exchanges code for JWT
GET /api/auth/oauth/google/callback?code=xxx

After successful OAuth, you are redirected to the frontend with your JWT token.

Proxy Endpoint

POST /v1/chat/completions

The core endpoint. Fully compatible with the OpenAI Chat Completions API.

Request
{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain recursion in 2 sentences." }
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Supported Parameters

ParameterTypeDescription
modelstringRequested model (may be overridden by routing)
messagesarrayArray of message objects (role + content)
temperaturefloatSampling temperature (0-2)
max_tokensintMaximum tokens in response
top_pfloatNucleus sampling
frequency_penaltyfloatFrequency penalty (-2 to 2)
presence_penaltyfloatPresence penalty (-2 to 2)
stopstring/arrayStop sequences
streamboolEnable SSE streaming
nintNumber of completions
userstringEnd-user identifier

Response

Response
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709312000,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Recursion is when a function calls itself..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 45,
    "total_tokens": 73
  },
  "promptly_metadata": {
    "baseline_model": "gpt-4o",
    "routed_model": "gpt-4o-mini",
    "original_tokens": 52,
    "optimized_tokens": 28,
    "savings": 0.00034,
    "cache_hit": false,
    "latency_ms": 834
  }
}

Notice promptly_metadata in the response - this tells you what Promptly did: the request asked for gpt-4o but routing classified it as simple → used gpt-4o-mini. Prompt optimization reduced 52 → 28 tokens (46% saving). Total cost savings: $0.00034.

SDK Integration

Use our official SDK for the fastest setup, or change base_url in any OpenAI-compatible SDK.

Promptly Python SDK

bash
pip install promptly-sdk
python
from promptly import Promptly

client = Promptly(api_key="sk-promptly-abc123...")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Promptly Node.js SDK

bash
npm install promptly-sdk
typescript
import Promptly from "promptly-sdk";

const client = new Promptly({ apiKey: "sk-promptly-abc123..." });

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

Wrap an Existing Client

Already have an OpenAI client? Wrap it to route through Promptly without changing any other code:

python
from promptly import wrap
from openai import OpenAI

client = wrap(OpenAI(api_key="sk-promptly-abc123..."))
# All existing code works unchanged

Alternative: Base URL method

You can also use any OpenAI-compatible SDK by changing base_url and api_key.

Python (openai SDK)

python
from openai import OpenAI

client = OpenAI(
    api_key="sk-promptly-abc123...",
    base_url="https://api.getpromptly.in/v1",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

Node.js (openai SDK)

javascript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-promptly-abc123...",
  baseURL: "https://api.getpromptly.in/v1",
});

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello!" }],
});

cURL

bash
curl https://api.getpromptly.in/v1/chat/completions \
  -H "Authorization: Bearer sk-promptly-abc123..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

LangChain (Python)

python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    api_key="sk-promptly-abc123...",
    base_url="https://api.getpromptly.in/v1",
)

response = llm.invoke("Explain quantum computing.")

LlamaIndex

python
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o",
    api_key="sk-promptly-abc123...",
    api_base="https://api.getpromptly.in/v1",
)

Vercel AI SDK (TypeScript)

typescript
import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

const result = await generateText({
  model: openai("gpt-4o", {
    baseURL: "https://api.getpromptly.in/v1",
    apiKey: "sk-promptly-abc123...",
  }),
  prompt: "Explain recursion.",
});

Streaming

Promptly fully supports Server-Sent Events (SSE) streaming - the same format as OpenAI.

Python

python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about code."}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Node.js

javascript
const stream = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a poem about code." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Raw SSE Format

data: {"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
Streaming requests still get prompt optimization and routing. Only semantic caching is skipped for streams (responses aren't cached from streaming calls).

Optimization Settings

Control how aggressively Promptly optimizes your prompts.

Get Current Settings

bash
GET /api/optimization/settings

Update Settings

PUT /api/optimization/settings
{
  "level": "moderate",
  "whitespace_removal": true,
  "redundancy_elimination": true,
  "system_prompt_compression": true,
  "context_pruning": true,
  "cache_enabled": true,
  "cache_similarity_threshold": 0.95,
  "cache_ttl_seconds": 3600,
  "routing_mode": "auto"
}

Optimization Levels

LevelWhat It DoesExpected Savings
conservativeWhitespace only, 20-message context window10-20%
moderate (default)All optimizations, 10-message window + system dedup30-50%
aggressiveAll optimizations, 6-message window + summary injection50-70%
offNo optimization - raw passthrough0%

Individual Feature Toggles

FeatureDefaultWhat It Does
whitespace_removaltrueNormalizes whitespace, removes extra newlines/spaces
redundancy_eliminationtrueDeduplicates repeated phrases and instructions
system_prompt_compressiontrueCompresses verbose system prompts while keeping intent
context_pruningtrueTrims old conversation turns
cache_enabledtrueEnables semantic caching

Smart Routing

When routing_mode is set to "auto", Promptly analyzes each request and picks the cheapest model capable of handling it.

How Classification Works

Promptly uses AI-powered classification to analyze every request on a complexity scale. It considers factors like prompt length, intent, and content type to determine the optimal model tier:

Input PatternClassification
Short, straightforward prompts→ simple
Moderate length with clear intent→ medium
Long prompts, code analysis, multi-step reasoning→ complex

Model Selection by Provider

ComplexityOpenAIAnthropicGoogle
simplegpt-4o-miniclaude-haikugemini-2.0-flash
mediumgpt-4oclaude-sonnetgemini-2.0-flash
complexgpt-4oclaude-opusgemini-2.0-pro

Routing Modes

ModeBehavior
autoClassifies complexity → picks cheapest capable model
manualAlways uses the model specified in your request

Fallback

If the routed model fails, Promptly automatically retries with the original model and unoptimized prompt. The response will include fallback_used: true in the logs.

Semantic Caching

When semantic caching is enabled, Promptly uses embedding-based similarity matching to detect when a user asks something similar to a previous request.

How It Works

Promptly uses a two-layer caching system:
1. Exact match - Identical prompts are matched instantly.
2. Semantic similarity - Similar prompts are matched using AI-powered embedding comparison.
3. If similarity ≥ threshold (default: 0.95), the cached response is returned instantly.
4. If it's a miss, the response is cached after the LLM call.

Cache Hit Behavior

Cost: $0 - no LLM call is made. Latency: ~2-50ms - exact matches are near-instant, semantic lookups take slightly longer. Logged as a cache hit in your analytics.

Configuration

SettingDefaultDescription
cache_enabledtrueEnable/disable caching
cache_similarity_threshold0.95Minimum cosine similarity (0.0 - 1.0). Lower = more hits, less precision
cache_ttl_seconds3600How long cached responses live (seconds)

Cache Management

Cache endpoints
# View cache statistics
GET /api/optimization/cache/stats
# Response: { "exact_entries": 812, "semantic_entries": 435, "total_entries": 1247 }

# Clear all cached responses
POST /api/optimization/cache/clear

Examples

Request 1: "What's the capital of France?"        → Miss → calls LLM → caches
Request 2: "capital of france?"                    → Hit (0.97 similarity) → $0
Request 3: "What is France's capital city?"        → Hit (0.96 similarity) → $0
Request 4: "What's the capital of Germany?"        → Miss (0.72 similarity) → calls LLM

Context Pruning

Long conversations waste tokens on stale context. Context pruning trims old turns automatically based on your optimization level.

Behavior by Level

LevelWindow SizeSystem DedupSummary Injection
conservativeLast 20 messagesNoNo
moderateLast 10 messagesYes - removes duplicate system messagesNo
aggressiveLast 6 messagesYesYes - injects a one-line summary of dropped messages

Example

# 48-message conversation with aggressive pruning:

Before: 4,200 tokens (full history)
After:    980 tokens (6 latest messages + summary of dropped ones)

# Summary injected as system message:
"[Earlier context pruned: 8 user and 7 assistant messages were removed to save tokens.]"

Context pruning runs only on the messages array. The most recent messages are always preserved. System messages are kept (but deduplicated on moderate/aggressive).

Custom Routing Rules

Override auto-routing with custom rules that are checked before classification.

Create a Rule

POST /api/routing/rules
{
  "name": "Code tasks use GPT-4o",
  "condition_type": "keyword",
  "condition_value": { "keywords": ["code", "debug", "refactor"] },
  "action": "use_model",
  "action_value": "gpt-4o",
  "priority": 10
}

Condition Types

TypeValue FormatDescription
keyword{"keywords": ["word1", "word2"]}Matches if any keyword appears in the messages
token_count{"operator": "gt", "value": 2000}Matches on approximate token count (gt or lt)
always{}Matches every request (useful as a catch-all)

Action Types

ActionValueDescription
use_model"gpt-4o-mini"Route directly to a specific model
force_tier"simple"Force a complexity tier (simple/medium/complex)

Examples

Short prompts → mini
{
  "name": "Short prompts → mini",
  "condition_type": "token_count",
  "condition_value": { "operator": "lt", "value": 50 },
  "action": "use_model",
  "action_value": "gpt-4o-mini",
  "priority": 20
}
Default everything to Sonnet
{
  "name": "Default everything to Sonnet",
  "condition_type": "always",
  "condition_value": {},
  "action": "use_model",
  "action_value": "claude-sonnet",
  "priority": 0
}

List / Delete Rules

bash
# List all rules
GET /api/routing/rules

# Delete a rule
DELETE /api/routing/rules/{rule_id}

Alerts

Set up alerts when metrics cross thresholds.

Create an Alert Rule

POST /api/alerts/rules
{
  "name": "High daily spend",
  "metric": "daily_spend",
  "operator": "gt",
  "threshold": 50.00,
  "channels": ["email"]
}

Available Metrics

MetricDescription
daily_spendTotal cost for the day ($)
error_ratePercentage of failed requests
latency_p9595th percentile latency (ms)
savings_percentageOptimization savings rate (%)

Operators

OperatorDescription
gtGreater than
ltLess than
gteGreater than or equal
lteLess than or equal

Manage Alerts

bash
# List all alerts
GET /api/alerts/rules

# Update an alert
PUT /api/alerts/rules/{rule_id}
{ "threshold": 100.00, "is_active": true }

# Delete an alert
DELETE /api/alerts/rules/{rule_id}

Alerts are evaluated automatically in the background on a recurring schedule.

Analytics & Request Logs

Dashboard Overview

bash
GET /api/analytics/overview

Returns total requests, costs, savings, cache hit rate, and other aggregate stats.

Savings Over Time

bash
GET /api/analytics/savings

Model Distribution

bash
GET /api/analytics/models

Savings by Feature

bash
GET /api/analytics/levers

Breaks down savings by optimization feature (whitespace, redundancy, compression, caching, routing).

Request Logs

Every proxy request is logged with full detail:

bash
# List requests (paginated)
GET /api/requests

# Get a specific request
GET /api/requests/{request_id}

Each log entry includes: baseline model vs. routed model, original tokens vs. optimized tokens, cost breakdown and savings, cache hit status, optimizations applied, latency (total, proxy overhead, provider time), and success / fallback / error status.

Team Management

Team accounts (account_type: "team") can manage multiple users.

Invite a Member

POST /api/team/invite
{
  "email": "teammate@company.com",
  "role": "viewer"
}

Roles

RolePermissions
adminFull access - manage keys, settings, team, billing
memberStandard access - manage own key, view analytics, make requests
viewerRead-only - view dashboard, analytics, request logs

Manage Members

bash
# List team members
GET /api/team/members

# Update a member's role
PUT /api/team/members/{user_id}
{ "role": "admin" }

# Remove a member
DELETE /api/team/members/{user_id}

All team members operate under the same organization - sharing API keys, analytics, and settings.

API Reference

Base URLs

EnvironmentURL
Productionhttps://api.getpromptly.in
Local devhttp://localhost:8000

Proxy

MethodEndpointAuthDescription
POST/v1/chat/completionsAPI KeyChat completions (OpenAI-compatible)
POST/v1/completionsAPI KeyLegacy endpoint (redirects to chat)

Auth

MethodEndpointAuthDescription
POST/api/auth/registerNoneRegister org → returns JWT + API key
POST/api/auth/loginNoneLogin → returns JWT
GET/api/auth/oauth/githubNoneGitHub OAuth redirect
GET/api/auth/oauth/googleNoneGoogle OAuth redirect

Keys

MethodEndpointAuthDescription
GET/api/keys/platformJWTList Promptly API keys
POST/api/keys/platform/regenerateJWTRegenerate API key
GET/api/keys/providersJWTList provider keys
POST/api/keys/providersJWTAdd provider key
PUT/api/keys/providers/{id}JWTUpdate provider key
DELETE/api/keys/providers/{id}JWTDelete provider key

Optimization

MethodEndpointAuthDescription
GET/api/optimization/settingsJWTGet optimization config
PUT/api/optimization/settingsJWTUpdate optimization config
GET/api/optimization/cache/statsJWTCache statistics
POST/api/optimization/cache/clearJWTClear cache

Routing

MethodEndpointAuthDescription
GET/api/routing/configJWTGet routing config
PUT/api/routing/configJWTUpdate routing mode
GET/api/routing/rulesJWTList custom rules
POST/api/routing/rulesJWTCreate rule
DELETE/api/routing/rules/{id}JWTDelete rule

Analytics

MethodEndpointAuthDescription
GET/api/analytics/overviewJWTDashboard overview
GET/api/analytics/savingsJWTSavings time series
GET/api/analytics/modelsJWTModel usage breakdown
GET/api/analytics/leversJWTSavings by feature

Requests

MethodEndpointAuthDescription
GET/api/requestsJWTList request logs
GET/api/requests/{id}JWTGet request detail

Team

MethodEndpointAuthDescription
GET/api/team/membersJWTList members
POST/api/team/inviteJWTInvite member
PUT/api/team/members/{id}JWTUpdate role
DELETE/api/team/members/{id}JWTRemove member

Alerts

MethodEndpointAuthDescription
GET/api/alerts/rulesJWTList alert rules
POST/api/alerts/rulesJWTCreate alert
PUT/api/alerts/rules/{id}JWTUpdate alert
DELETE/api/alerts/rules/{id}JWTDelete alert

Organization

MethodEndpointAuthDescription
GET/api/orgJWTGet org details
PUT/api/orgJWTUpdate org

Error Handling

Promptly returns standard HTTP error codes with structured error bodies:

json
{
  "detail": "No valid provider key found. Please connect an API key in the dashboard."
}

Common Errors

CodeCause
401 UnauthorizedMissing, invalid, or inactive API key
400 Bad RequestNo provider key configured for the org
409 ConflictEmail already registered
429 Too Many RequestsRate limit exceeded
502 Bad GatewayUpstream provider (OpenAI, etc.) returned an error

Provider Failures + Fallback

If the routed model fails, Promptly automatically: (1) Falls back to the originally requested model, (2) Uses the original unoptimized prompt, (3) Retries once, and (4) Logs the event with status: "fallback".

Your app never sees a provider error unless both the optimized and fallback attempts fail.

Rate Limits

Promptly enforces per-key rate limiting:

LimitDefault
Requests per minute500

When exceeded, you'll receive a 429 Too Many Requests response. Rate limits apply per Promptly API key (i.e., per organization).

FAQ

Does Promptly store my prompts or responses?

Promptly logs request metadata (model, tokens, cost, latency) for analytics. Cached responses are stored with a configurable TTL and can be cleared at any time. Provider API keys are encrypted at rest.

Can I use my Anthropic or Google key instead of OpenAI?

Yes. Promptly supports OpenAI, Anthropic, and Google. You connect whichever provider key you want in the dashboard. Your app always talks to Promptly in OpenAI format - Promptly translates to the correct provider format behind the scenes.

Does optimization change the quality of responses?

No. Optimization removes redundant tokens (extra whitespace, repeated instructions, verbose phrasing) without changing the semantic meaning. The LLM receives the same intent with fewer tokens.

What happens if I send a request without connecting a provider key?

You'll get a 400 Bad Request:

json
{ "detail": "No valid provider key found. Please connect an API key in the dashboard." }

Can I disable all optimization and use Promptly as a pure proxy?

Yes. Set level to "off" and routing_mode to "manual":

Disable everything
PUT /api/optimization/settings
{ "level": "off", "routing_mode": "manual", "cache_enabled": false }

How is my provider API key stored?

Provider keys are encrypted using industry-standard encryption. The raw key is never stored - only encrypted references.

What's the difference between Personal and Team accounts?

Both get the same proxy, optimization, and routing features. Team accounts add multi-user support with role-based access (admin/viewer), invite flow, and shared analytics across team members.