Developer Guide

Complete guide to integrating Promptly into your application. Promptly is an OpenAI-compatible proxy that optimizes every LLM request - reducing token usage, lowering costs, and improving response times with zero code changes.

Quick Start

Get up and running in 3 minutes.

Step 1: Create an Account

curl -X POST https://api.getpromptly.in/api/auth/register \
  -H "Content-Type: application/json" \
  -d '{
    "org_name": "My Company",
    "email": "dev@mycompany.com",
    "password": "securepassword",
    "account_type": "team"
  }'

Response:

json

{
  "token": "eyJhbGciOiJIUz...",
  "user": { "id": "uuid", "email": "dev@mycompany.com", "role": "admin" },
  "org": { "id": "uuid", "name": "My Company", "account_type": "team" },
  "api_key": "sk-promptly-abc123..."
}

Important: Save the api_key - it's only shown once. This is your Promptly key.

Step 2: Connect Your Provider Key

Go to the dashboard → Keys → Add your OpenAI / Anthropic / Google API key. Or via API:

Add provider key

curl -X POST https://api.getpromptly.in/api/keys/providers \
  -H "Authorization: Bearer <your-jwt-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "api_key": "sk-your-openai-key"
  }'

Promptly encrypts and stores your provider key securely.

Step 3: Use Promptly in Your App

Python

import openai

client = openai.OpenAI(
    api_key="sk-promptly-abc123...",       # Your Promptly key
    base_url="https://api.getpromptly.in/v1"  # Point to Promptly
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantum computing?"}
    ]
)

print(response.choices[0].message.content)

That's it. Every request is now automatically optimized, cached, and logged.

How It Works

Request Flow

Your App                      Promptly                      LLM Provider
   │                             │                              │
   │──── POST /v1/chat/... ─────▶│                              │
   │     (Promptly API key)      │                              │
   │                             │── 1. Validate API key        │
   │                             │── 2. Check semantic cache    │
   │                             │      (hit? return instant)   │
   │                             │── 3. Optimize prompt         │
   │                             │      • Whitespace removal    │
   │                             │      • Redundancy dedup      │
   │                             │      • System compression    │
   │                             │      • Context pruning       │
   │                             │── 4. Classify complexity     │
   │                             │      → Pick cheapest model   │
   │                             │── 5. Forward to provider ───▶│
   │                             │      (your OpenAI key)       │──► LLM processes
   │                             │◀── 6. Receive response ──────│
   │                             │── 7. Cache response          │
   │                             │── 8. Log everything          │
   │                             │── 9. Evaluate alerts         │
   │◀── Response + metadata ─────│                              │

Key insight: Your provider key gets charged for the actual LLM call. Promptly never touches your LLM billing - it just reduces how many tokens get sent and picks cheaper models when possible.

Authentication

Promptly uses two separate authentication systems:

Proxy Requests (your app → Promptly)

Use your Promptly API key as a Bearer token:

Authorization: Bearer sk-promptly-abc123...

This key authenticates proxy requests to /v1/* endpoints. It identifies your organization and loads your provider keys + optimization config.

Dashboard API (managing settings)

Use the JWT token returned at login/register:

Authorization: Bearer eyJhbGciOiJIUz...

This authenticates dashboard API requests (/api/* endpoints) - managing keys, viewing analytics, configuring optimization, etc.

OAuth Login

Promptly also supports OAuth via GitHub and Google. These are redirect-based flows:

bash

# GitHub OAuth - redirects to GitHub authorization page
GET /api/auth/oauth/github
# GitHub callback - exchanges code for JWT
GET /api/auth/oauth/github/callback?code=xxx

# Google OAuth - redirects to Google authorization page
GET /api/auth/oauth/google
# Google callback - exchanges code for JWT
GET /api/auth/oauth/google/callback?code=xxx

After successful OAuth, you are redirected to the frontend with your JWT token.

Proxy Endpoint

POST `/v1/chat/completions`

The core endpoint. Fully compatible with the OpenAI Chat Completions API.

Request

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain recursion in 2 sentences." }
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Supported Parameters

Parameter	Type	Description
model	string	Requested model (may be overridden by routing)
messages	array	Array of message objects (role + content)
temperature	float	Sampling temperature (0-2)
max_tokens	int	Maximum tokens in response
top_p	float	Nucleus sampling
frequency_penalty	float	Frequency penalty (-2 to 2)
presence_penalty	float	Presence penalty (-2 to 2)
stop	string/array	Stop sequences
stream	bool	Enable SSE streaming
n	int	Number of completions
user	string	End-user identifier

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1709312000,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Recursion is when a function calls itself..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 45,
    "total_tokens": 73
  },
  "promptly_metadata": {
    "baseline_model": "gpt-4o",
    "routed_model": "gpt-4o-mini",
    "original_tokens": 52,
    "optimized_tokens": 28,
    "savings": 0.00034,
    "cache_hit": false,
    "latency_ms": 834
  }
}

Notice promptly_metadata in the response - this tells you what Promptly did: the request asked for gpt-4o but routing classified it as simple → used gpt-4o-mini. Prompt optimization reduced 52 → 28 tokens (46% saving). Total cost savings: $0.00034.

SDK Integration

Use our official SDK for the fastest setup, or change base_url in any OpenAI-compatible SDK.

Promptly Python SDK

bash

pip install promptly-sdk

python

from promptly import Promptly

client = Promptly(api_key="sk-promptly-abc123...")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Promptly Node.js SDK

bash

npm install promptly-sdk

typescript

import Promptly from "promptly-sdk";

const client = new Promptly({ apiKey: "sk-promptly-abc123..." });

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);

Wrap an Existing Client

Already have an OpenAI client? Wrap it to route through Promptly without changing any other code:

python

from promptly import wrap
from openai import OpenAI

client = wrap(OpenAI(api_key="sk-promptly-abc123..."))
# All existing code works unchanged

Alternative: Base URL method

You can also use any OpenAI-compatible SDK by changing base_url and api_key.

Python (openai SDK)

python

from openai import OpenAI

client = OpenAI(
    api_key="sk-promptly-abc123...",
    base_url="https://api.getpromptly.in/v1",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)

Node.js (openai SDK)

javascript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-promptly-abc123...",
  baseURL: "https://api.getpromptly.in/v1",
});

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello!" }],
});

cURL

bash

curl https://api.getpromptly.in/v1/chat/completions \
  -H "Authorization: Bearer sk-promptly-abc123..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

LangChain (Python)

python

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o",
    api_key="sk-promptly-abc123...",
    base_url="https://api.getpromptly.in/v1",
)

response = llm.invoke("Explain quantum computing.")

LlamaIndex

python

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o",
    api_key="sk-promptly-abc123...",
    api_base="https://api.getpromptly.in/v1",
)

Vercel AI SDK (TypeScript)

typescript

import { openai } from "@ai-sdk/openai";
import { generateText } from "ai";

const result = await generateText({
  model: openai("gpt-4o", {
    baseURL: "https://api.getpromptly.in/v1",
    apiKey: "sk-promptly-abc123...",
  }),
  prompt: "Explain recursion.",
});

Streaming

Promptly fully supports Server-Sent Events (SSE) streaming - the same format as OpenAI.

Python

python

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem about code."}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Node.js

javascript

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Write a poem about code." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Raw SSE Format

data: {"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Streaming requests still get prompt optimization and routing. Only semantic caching is skipped for streams (responses aren't cached from streaming calls).

Optimization Settings

Control how aggressively Promptly optimizes your prompts.

Get Current Settings

bash

GET /api/optimization/settings

Update Settings

PUT /api/optimization/settings

{
  "level": "moderate",
  "whitespace_removal": true,
  "redundancy_elimination": true,
  "system_prompt_compression": true,
  "context_pruning": true,
  "cache_enabled": true,
  "cache_similarity_threshold": 0.95,
  "cache_ttl_seconds": 3600,
  "routing_mode": "auto"
}

Optimization Levels

Level	What It Does	Expected Savings
conservative	Whitespace only, 20-message context window	10-20%
moderate (default)	All optimizations, 10-message window + system dedup	30-50%
aggressive	All optimizations, 6-message window + summary injection	50-70%
off	No optimization - raw passthrough	0%

Individual Feature Toggles

Feature	Default	What It Does
whitespace_removal	true	Normalizes whitespace, removes extra newlines/spaces
redundancy_elimination	true	Deduplicates repeated phrases and instructions
system_prompt_compression	true	Compresses verbose system prompts while keeping intent
context_pruning	true	Trims old conversation turns
cache_enabled	true	Enables semantic caching

Smart Routing

When routing_mode is set to "auto", Promptly analyzes each request and picks the cheapest model capable of handling it.

How Classification Works

Promptly uses AI-powered classification to analyze every request on a complexity scale. It considers factors like prompt length, intent, and content type to determine the optimal model tier:

Input Pattern	Classification
Short, straightforward prompts	→ simple
Moderate length with clear intent	→ medium
Long prompts, code analysis, multi-step reasoning	→ complex

Model Selection by Provider

Complexity	OpenAI	Anthropic	Google
simple	gpt-4o-mini	claude-haiku	gemini-2.0-flash
medium	gpt-4o	claude-sonnet	gemini-2.0-flash
complex	gpt-4o	claude-opus	gemini-2.0-pro

Routing Modes

Mode	Behavior
auto	Classifies complexity → picks cheapest capable model
manual	Always uses the model specified in your request

Fallback

If the routed model fails, Promptly automatically retries with the original model and unoptimized prompt. The response will include fallback_used: true in the logs.

Semantic Caching

When semantic caching is enabled, Promptly uses embedding-based similarity matching to detect when a user asks something similar to a previous request.

How It Works

Promptly uses a two-layer caching system:
1. Exact match - Identical prompts are matched instantly.
2. Semantic similarity - Similar prompts are matched using AI-powered embedding comparison.
3. If similarity ≥ threshold (default: 0.95), the cached response is returned instantly.
4. If it's a miss, the response is cached after the LLM call.

Cache Hit Behavior

Cost: $0 - no LLM call is made. Latency: ~2-50ms - exact matches are near-instant, semantic lookups take slightly longer. Logged as a cache hit in your analytics.

Configuration

Setting	Default	Description
cache_enabled	true	Enable/disable caching
cache_similarity_threshold	0.95	Minimum cosine similarity (0.0 - 1.0). Lower = more hits, less precision
cache_ttl_seconds	3600	How long cached responses live (seconds)

Cache Management

Cache endpoints

# View cache statistics
GET /api/optimization/cache/stats
# Response: { "exact_entries": 812, "semantic_entries": 435, "total_entries": 1247 }

# Clear all cached responses
POST /api/optimization/cache/clear

Examples

Request 1: "What's the capital of France?"        → Miss → calls LLM → caches
Request 2: "capital of france?"                    → Hit (0.97 similarity) → $0
Request 3: "What is France's capital city?"        → Hit (0.96 similarity) → $0
Request 4: "What's the capital of Germany?"        → Miss (0.72 similarity) → calls LLM

Context Pruning

Long conversations waste tokens on stale context. Context pruning trims old turns automatically based on your optimization level.

Behavior by Level

Level	Window Size	System Dedup	Summary Injection
conservative	Last 20 messages	No	No
moderate	Last 10 messages	Yes - removes duplicate system messages	No
aggressive	Last 6 messages	Yes	Yes - injects a one-line summary of dropped messages

Example

# 48-message conversation with aggressive pruning:

Before: 4,200 tokens (full history)
After:    980 tokens (6 latest messages + summary of dropped ones)

# Summary injected as system message:
"[Earlier context pruned: 8 user and 7 assistant messages were removed to save tokens.]"

Context pruning runs only on the messages array. The most recent messages are always preserved. System messages are kept (but deduplicated on moderate/aggressive).

Custom Routing Rules

Override auto-routing with custom rules that are checked before classification.

Create a Rule

POST /api/routing/rules

{
  "name": "Code tasks use GPT-4o",
  "condition_type": "keyword",
  "condition_value": { "keywords": ["code", "debug", "refactor"] },
  "action": "use_model",
  "action_value": "gpt-4o",
  "priority": 10
}

Condition Types

Type	Value Format	Description
keyword	{"keywords": ["word1", "word2"]}	Matches if any keyword appears in the messages
token_count	{"operator": "gt", "value": 2000}	Matches on approximate token count (gt or lt)
always	{}	Matches every request (useful as a catch-all)

Action Types

Action	Value	Description
use_model	"gpt-4o-mini"	Route directly to a specific model
force_tier	"simple"	Force a complexity tier (simple/medium/complex)

Examples

Short prompts → mini

{
  "name": "Short prompts → mini",
  "condition_type": "token_count",
  "condition_value": { "operator": "lt", "value": 50 },
  "action": "use_model",
  "action_value": "gpt-4o-mini",
  "priority": 20
}

Default everything to Sonnet

{
  "name": "Default everything to Sonnet",
  "condition_type": "always",
  "condition_value": {},
  "action": "use_model",
  "action_value": "claude-sonnet",
  "priority": 0
}

List / Delete Rules

bash

# List all rules
GET /api/routing/rules

# Delete a rule
DELETE /api/routing/rules/{rule_id}

Alerts

Set up alerts when metrics cross thresholds.

Create an Alert Rule

POST /api/alerts/rules

{
  "name": "High daily spend",
  "metric": "daily_spend",
  "operator": "gt",
  "threshold": 50.00,
  "channels": ["email"]
}

Available Metrics

Metric	Description
daily_spend	Total cost for the day ($)
error_rate	Percentage of failed requests
latency_p95	95th percentile latency (ms)
savings_percentage	Optimization savings rate (%)

Operators

Operator	Description
gt	Greater than
lt	Less than
gte	Greater than or equal
lte	Less than or equal

Manage Alerts

bash

# List all alerts
GET /api/alerts/rules

# Update an alert
PUT /api/alerts/rules/{rule_id}
{ "threshold": 100.00, "is_active": true }

# Delete an alert
DELETE /api/alerts/rules/{rule_id}

Alerts are evaluated automatically in the background on a recurring schedule.

Analytics & Request Logs

Dashboard Overview

bash

GET /api/analytics/overview

Returns total requests, costs, savings, cache hit rate, and other aggregate stats.

Savings Over Time

bash

GET /api/analytics/savings

Model Distribution

bash

GET /api/analytics/models

Savings by Feature

bash

GET /api/analytics/levers

Breaks down savings by optimization feature (whitespace, redundancy, compression, caching, routing).

Request Logs

Every proxy request is logged with full detail:

bash

# List requests (paginated)
GET /api/requests

# Get a specific request
GET /api/requests/{request_id}

Each log entry includes: baseline model vs. routed model, original tokens vs. optimized tokens, cost breakdown and savings, cache hit status, optimizations applied, latency (total, proxy overhead, provider time), and success / fallback / error status.

Team Management

Team accounts (account_type: "team") can manage multiple users.

Invite a Member

POST /api/team/invite

{
  "email": "teammate@company.com",
  "role": "viewer"
}

Roles

Role	Permissions
admin	Full access - manage keys, settings, team, billing
member	Standard access - manage own key, view analytics, make requests
viewer	Read-only - view dashboard, analytics, request logs

Manage Members

bash

# List team members
GET /api/team/members

# Update a member's role
PUT /api/team/members/{user_id}
{ "role": "admin" }

# Remove a member
DELETE /api/team/members/{user_id}

All team members operate under the same organization - sharing API keys, analytics, and settings.

API Reference

Base URLs

Environment	URL
Production	https://api.getpromptly.in
Local dev	http://localhost:8000

Proxy

Method	Endpoint	Auth	Description
POST	/v1/chat/completions	API Key	Chat completions (OpenAI-compatible)
POST	/v1/completions	API Key	Legacy endpoint (redirects to chat)

Auth

Method	Endpoint	Auth	Description
POST	/api/auth/register	None	Register org → returns JWT + API key
POST	/api/auth/login	None	Login → returns JWT
GET	/api/auth/oauth/github	None	GitHub OAuth redirect
GET	/api/auth/oauth/google	None	Google OAuth redirect

Keys

Method	Endpoint	Auth	Description
GET	/api/keys/platform	JWT	List Promptly API keys
POST	/api/keys/platform/regenerate	JWT	Regenerate API key
GET	/api/keys/providers	JWT	List provider keys
POST	/api/keys/providers	JWT	Add provider key
PUT	/api/keys/providers/{id}	JWT	Update provider key
DELETE	/api/keys/providers/{id}	JWT	Delete provider key

Optimization

Method	Endpoint	Auth	Description
GET	/api/optimization/settings	JWT	Get optimization config
PUT	/api/optimization/settings	JWT	Update optimization config
GET	/api/optimization/cache/stats	JWT	Cache statistics
POST	/api/optimization/cache/clear	JWT	Clear cache

Routing

Method	Endpoint	Auth	Description
GET	/api/routing/config	JWT	Get routing config
PUT	/api/routing/config	JWT	Update routing mode
GET	/api/routing/rules	JWT	List custom rules
POST	/api/routing/rules	JWT	Create rule
DELETE	/api/routing/rules/{id}	JWT	Delete rule

Analytics

Method	Endpoint	Auth	Description
GET	/api/analytics/overview	JWT	Dashboard overview
GET	/api/analytics/savings	JWT	Savings time series
GET	/api/analytics/models	JWT	Model usage breakdown
GET	/api/analytics/levers	JWT	Savings by feature

Requests

Method	Endpoint	Auth	Description
GET	/api/requests	JWT	List request logs
GET	/api/requests/{id}	JWT	Get request detail

Team

Method	Endpoint	Auth	Description
GET	/api/team/members	JWT	List members
POST	/api/team/invite	JWT	Invite member
PUT	/api/team/members/{id}	JWT	Update role
DELETE	/api/team/members/{id}	JWT	Remove member

Alerts

Method	Endpoint	Auth	Description
GET	/api/alerts/rules	JWT	List alert rules
POST	/api/alerts/rules	JWT	Create alert
PUT	/api/alerts/rules/{id}	JWT	Update alert
DELETE	/api/alerts/rules/{id}	JWT	Delete alert

Organization

Method	Endpoint	Auth	Description
GET	/api/org	JWT	Get org details
PUT	/api/org	JWT	Update org

Error Handling

Promptly returns standard HTTP error codes with structured error bodies:

json

{
  "detail": "No valid provider key found. Please connect an API key in the dashboard."
}

Common Errors

Code	Cause
401 Unauthorized	Missing, invalid, or inactive API key
400 Bad Request	No provider key configured for the org
409 Conflict	Email already registered
429 Too Many Requests	Rate limit exceeded
502 Bad Gateway	Upstream provider (OpenAI, etc.) returned an error

Provider Failures + Fallback

If the routed model fails, Promptly automatically: (1) Falls back to the originally requested model, (2) Uses the original unoptimized prompt, (3) Retries once, and (4) Logs the event with status: "fallback".

Your app never sees a provider error unless both the optimized and fallback attempts fail.

Rate Limits

Promptly enforces per-key rate limiting:

Limit	Default
Requests per minute	500

When exceeded, you'll receive a 429 Too Many Requests response. Rate limits apply per Promptly API key (i.e., per organization).

FAQ

Does Promptly store my prompts or responses?

Promptly logs request metadata (model, tokens, cost, latency) for analytics. Cached responses are stored with a configurable TTL and can be cleared at any time. Provider API keys are encrypted at rest.

Can I use my Anthropic or Google key instead of OpenAI?

Yes. Promptly supports OpenAI, Anthropic, and Google. You connect whichever provider key you want in the dashboard. Your app always talks to Promptly in OpenAI format - Promptly translates to the correct provider format behind the scenes.

Does optimization change the quality of responses?

No. Optimization removes redundant tokens (extra whitespace, repeated instructions, verbose phrasing) without changing the semantic meaning. The LLM receives the same intent with fewer tokens.

What happens if I send a request without connecting a provider key?

You'll get a 400 Bad Request:

json

{ "detail": "No valid provider key found. Please connect an API key in the dashboard." }

Can I disable all optimization and use Promptly as a pure proxy?

Yes. Set level to "off" and routing_mode to "manual":

Disable everything

PUT /api/optimization/settings
{ "level": "off", "routing_mode": "manual", "cache_enabled": false }

How is my provider API key stored?

Provider keys are encrypted using industry-standard encryption. The raw key is never stored - only encrypted references.

What's the difference between Personal and Team accounts?

Both get the same proxy, optimization, and routing features. Team accounts add multi-user support with role-based access (admin/viewer), invite flow, and shared analytics across team members.

Developer Guide

Quick Start

Step 1: Create an Account

Step 2: Connect Your Provider Key

Step 3: Use Promptly in Your App

How It Works

Authentication

Proxy Requests (your app → Promptly)

Dashboard API (managing settings)

OAuth Login

Proxy Endpoint

POST /v1/chat/completions

Supported Parameters

Response

SDK Integration

Promptly Python SDK

Promptly Node.js SDK

Wrap an Existing Client

Python (openai SDK)

Node.js (openai SDK)

cURL

LangChain (Python)

LlamaIndex

Vercel AI SDK (TypeScript)

Streaming

Python

Node.js

Raw SSE Format

Optimization Settings

Get Current Settings

Update Settings

Optimization Levels

Individual Feature Toggles

Smart Routing

How Classification Works

Model Selection by Provider

Routing Modes

Fallback

Semantic Caching

How It Works

Cache Hit Behavior

Configuration

Cache Management

Examples

Context Pruning

Behavior by Level

Example

Custom Routing Rules

Create a Rule

Condition Types

Action Types

Examples

List / Delete Rules

Alerts

Create an Alert Rule

Available Metrics

Operators

Manage Alerts

Analytics & Request Logs

Dashboard Overview

Savings Over Time

Model Distribution

Savings by Feature

Request Logs

Team Management

Invite a Member

Roles

Manage Members

API Reference

Base URLs

Proxy

Auth

Keys

Optimization

Routing

Analytics

Requests

Team

Alerts

Organization

POST `/v1/chat/completions`