Inference API
aeqi-inference is an OpenAI-compatible API for running inference on multiple LLM providers. It exposes chat completions, embeddings (stub), and model discovery endpoints.
Base URL
| Environment | URL |
|---|---|
| Hosted | https://app.aeqi.ai/v1 |
| Self-hosted | http://127.0.0.1:8443/v1 |
/v1/* is served by the platform binary alongside /api/*, not by a separate inference daemon.
Authentication
Hosted /v1/* requests use the same platform auth path as the dashboard and
REST API:
Authorization: Bearer <token>— a user JWT or an account API key (ak_...).X-Trust: <trust_id>— selects which TRUST runtime the request is scoped to.
Missing or invalid auth returns 401 Unauthorized. Missing X-Trust returns
400 Bad Request.
GET /v1/models is authenticated on the hosted platform because the /v1
router is protected as a group.
Lanes
Three planned payment lanes for inference:
| Lane | Status | How it bills |
|---|---|---|
| Subscription | Auth wired; balance enforcement staged | Pooled per-TRUST credit, debited on each call when the billing layer is mounted. |
| Treasury | Scaffolded | Direct USDC debit from the Company's on-chain treasury. |
| x402 | Planned | HTTP 402 + EIP-3009 USDC per request. |
The hosted mount currently enforces platform auth and TRUST selection. Balance debit and treasury/x402 payment rails are staged behind the inference billing layer and should not be treated as universally enforced by the current hosted router.
Endpoints
POST /v1/chat/completions
Chat completion endpoint with optional streaming.
Request:
{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the capital of France?" }
],
"stream": false,
"temperature": 0.7,
"max_tokens": 1024
}
Request fields:
model(required) — one of Available Models.messages(required) — array of{ role, content }. Roles:"system","user","assistant","tool".stream(optional, defaultfalse) —truefor Server-Sent Events.temperature(optional, default1.0) —0.0–2.0.max_tokens(optional) — defaults to the model's default.
Non-streaming response (200):
{
"id": "cmpl-...",
"object": "chat.completion",
"created": 1714861200,
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "Paris is the capital of France." },
"finish_reason": "stop"
}
],
"usage": { "prompt_tokens": 45, "completion_tokens": 12, "total_tokens": 57 }
}
Streaming response (200):
Content-Type: text/event-stream. Each chunk uses object: "chat.completion.chunk":
data: {"id":"cmpl-...","object":"chat.completion.chunk","created":1714861200,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Paris"},"finish_reason":null}]}
data: {"id":"cmpl-...","object":"chat.completion.chunk","created":1714861200,"model":"meta-llama/Meta-Llama-3.1-70B-Instruct","choices":[{"index":0,"delta":{"content":" is"},"finish_reason":null}]}
data: [DONE]
Errors:
| Status | Code | Reason |
|---|---|---|
| 401 | auth_error |
Missing bearer header. |
| 402 | billing_error |
Reserved for billing enforcement when the inference billing layer is mounted. |
| 400 | invalid_request_error |
Model not in ALLOWED_MODELS, missing field, or malformed body. |
| 502 | upstream_error |
DeepInfra returned an error or unexpected payload. |
| 503 | upstream_error |
DeepInfra unreachable. |
| 500 | internal_error |
Internal aeqi-inference error. |
GET /v1/models
Authenticated on the hosted platform. List of currently-available models.
Response (200):
{
"object": "list",
"data": [
{ "id": "meta-llama/Meta-Llama-3.1-70B-Instruct", "object": "model", "created": 1746403200, "owned_by": "deepinfra" },
{ "id": "meta-llama/Meta-Llama-3.1-8B-Instruct", "object": "model", "created": 1746403200, "owned_by": "deepinfra" }
]
}
The endpoint also lists planned-but-unrouted models (e.g. gpt-5, claude-sonnet-4-6) with their advertised providers; calling chat-completions against them returns 400 until the corresponding upstream adapter ships.
POST /v1/embeddings
Embedding endpoint. Not yet implemented. Returns 501 Not Implemented. Reserved for Phase 2.
Available Models
The ALLOWED_MODELS whitelist in aeqi-inference/src/upstream/deepinfra.rs:
| Model | Provider | Input $/M | Output $/M |
|---|---|---|---|
meta-llama/Meta-Llama-3.1-70B-Instruct |
DeepInfra | $0.59 | $0.79 |
meta-llama/Meta-Llama-3.1-8B-Instruct |
DeepInfra | $0.055 | $0.055 |
meta-llama/Meta-Llama-3.3-70B-Instruct |
DeepInfra | $0.59 | $0.79 |
mistralai/Mistral-7B-Instruct-v0.3 |
DeepInfra | $0.055 | $0.055 |
Qwen/Qwen2.5-72B-Instruct |
DeepInfra | $0.35 | $0.40 |
deepinfra/airoboros-70b |
DeepInfra | $0.70 | $0.70 |
Prices are pulled from DeepInfra's public pricing page and snapshotted in pricing_for(). They are not refreshed automatically — when DeepInfra publishes a change, update both pricing_for() and this table together.
OpenAI, Anthropic, and DeepSeek provider adapters are stubs at this time.
Billing & Cost Accounting
The inference crate contains cost-accounting helpers and a balance-store interface. The hosted platform currently mounts the inference router behind auth; production balance enforcement is staged with the billing layer.
Pricing model:
- Intended billing is per million tokens (input and output separately).
compute_cost_microdollars(model, prompt_tokens, completion_tokens)produces the debit amount; it returns0for any model not inpricing_for().- When the billing layer is mounted, cost is debited on completion of the response. Mid-stream debits are not yet implemented.
Example:
- Model
meta-llama/Meta-Llama-3.1-70B-Instruct. - Input 1000 tokens at $0.59/M = $0.00059.
- Output 500 tokens at $0.79/M = $0.000395.
- Total ~$0.0009 (0.09¢).
402 Payment Required is reserved for the billing layer. Do not rely on it as a
current hosted-platform quota signal unless your deployment explicitly enables
inference balance enforcement.
To top up, use the dashboard. Programmatic USDC top-up (treasury lane) lands in Phase 2.
Examples
Non-streaming
const response = await fetch("https://app.aeqi.ai/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${token}`,
"X-Trust": trustId,
},
body: JSON.stringify({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct",
messages: [{ role: "user", content: "Explain quantum computing in two sentences." }],
stream: false,
}),
});
const data = await response.json();
console.log(data.choices[0].message.content);
Streaming
const response = await fetch("https://app.aeqi.ai/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${token}`,
"X-Trust": trustId,
},
body: JSON.stringify({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct",
messages: [{ role: "user", content: "Write a haiku about code." }],
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split("\n")) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") return;
const chunk = JSON.parse(data);
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
}
List models
const response = await fetch("https://app.aeqi.ai/v1/models", {
headers: {
"Authorization": `Bearer ${token}`,
"X-Trust": trustId,
},
});
const models = await response.json();
models.data.forEach((m) => console.log(`${m.id} (by ${m.owned_by})`));
OpenAI Compatibility
POST /v1/chat/completions is wire-compatible with OpenAI's chat-completions API. Change the base URL and pass a bearer token:
from openai import OpenAI
client = OpenAI(
api_key="<bearer-token>",
base_url="https://app.aeqi.ai/v1",
default_headers={"X-Trust": "<trust_id>"},
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "<bearer-token>",
baseURL: "https://app.aeqi.ai/v1",
defaultHeaders: { "X-Trust": "<trust_id>" },
});
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
Phase 2 Roadmap
- Real JWT verification — replace the bearer-presence stub with full signature verification against the platform's
auth_secret. - Persistent balances — SQLite-backed
inference_balance_cents, LRU-cached. - Treasury lane — USDC debit from a Company's on-chain treasury.
- x402 lane — HTTP 402 + EIP-3009 for pay-per-request inference.
- Additional providers — OpenAI, Anthropic, DeepSeek (the model list already advertises them; only routing remains).
- Embeddings —
POST /v1/embeddings. - Mid-stream token accounting — true debit at stream-close based on actual tokens.