The easiest way to get started is by using our interactive setup script. It will configure terminal-based monitoring tool, download your first model, and let you test the engine's capabilities via benchmarks or the interactive chat shell.
# Make the script executable
chmod +x setup.sh
# Run the interactive setup, honestly this is the best way to get started
./setup.shRun the Runtime Comparison (LM Studio vs Bodega Inference Engine) with leaderboard upload:
# Ensure the same model is loaded in LM Studio with Max Concurrent Predictions = 32
python compare_engines.py --model srswti/bodega-orion-0.6b \
--lmstudio-model-id bodega-orion-0.6b \
--output results/compare_$(date +%Y%m%d_%H%M%S).json \
--leaderboard-url https://leaderboard.srswti.comUse --lmstudio-model-id to match the model ID shown in LM Studio (often the short name, e.g. bodega-orion-0.6b). Results are posted to the global leaderboard.
Bodega Inference Engine delivers enterprise-grade inference directly on your machine. Built specifically for Apple Silicon, it provides a seamless runtime with openai-compatible api which is faster, more memory efficient and intuitive than any other runtime available out there.
Architecture: Multi-process isolated handler architecture prevents Metal memory leaks.
As of the latest release, Bodega is a multi-model registry — you can load, route to, and unload multiple models simultaneously, each running in its own hardware-isolated subprocess. The engine automatically handles resource allocation and delivers the fastest possible inference on Apple Silicon.
Key Capabilities:
- Multi-model registry with dynamic loading and unloading
- Language model inference with streaming support
- Multimodal language model support (vision)
- Image generation (live next week, week of March 17)
- Image editing (live next week, week of March 17)
- Structured output via JSON schema constraints
- Speculative decoding for accelerated generation
- Continuous batching for high-throughput workloads
- Built-in prompt caching
- Getting Started
- Core Endpoints
- Model Management
- Model Discovery
- Advanced Features
- Monitoring and Health
- Best Practices
Start the server and load your first model:
# Or dynamically load a model via API
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_id": "bodega-raptor-8b",
"model_type": "lm",
"context_length": 32768,
"prompt_cache_size": 10
}'
# Make your first inference request
curl -X POST http://localhost:44468/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bodega-raptor-8b",
"messages": [
{"role": "user", "content": "Hello, welcome to the world of dreamers?"}
]
}'The standard way to run multiple models is to keep calling the /v1/admin/load-model endpoint — each call spawns a new isolated subprocess for that model. You can check which models are running and their memory usage at any time via /health:
# Load a language model
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "bodega-orion-0.6b",
"model_type": "lm",
"model_path": "srswti/bodega-orion-0.6b"
}'
# Load a multimodal model alongside it
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "srswti/bodega-solomon-9b",
"model_type": "multimodal",
"model_path": "srswti/bodega-solomon-9b"
}'
# Load our favourite model alongside it :)
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "blackbird",
"model_type": "lm",
"model_path": "srswti/blackbird-she-doesnt-refuse-21b",
"context_length": 32768,
"max_concurrency": 1,
"reasoning_parser": "harmony",
"tool_call_parser": "harmony"
}'
# Check what's running
curl http://localhost:44468/health{
"status": "ok",
"model_id": "bodega-raptor-0.9b, srswti/bodega-solomon-9b",
"model_status": "initialized (2 model(s))",
"models_detail": [
{
"id": "bodega-raptor-0.9b",
"type": "lm",
"status": "running",
"ram_usage_mb": 3667.1
},
{
"id": "srswti/bodega-solomon-9b",
"type": "multimodal",
"status": "running",
"ram_usage_mb": 13344.5
}
]
}You can also load any HuggingFace model directly — not just SRSWTI models. For example, loading a community Qwen model with continuous batching:
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "Qwen/Qwen3-30B-A3B-MLX-4bit",
"model_type": "lm",
"model_path": "Qwen/Qwen3-30B-A3B-MLX-4bit",
"max_concurrency": 1,
"queue_timeout": 300,
"queue_size": 100,
"continuous_batching": true,
"cb_max_num_seqs": 256,
"cb_prefill_batch_size": 16,
"cb_completion_batch_size": 32
}' | python3 -m json.tool{
"status": "loaded",
"model_id": "Qwen/Qwen3-30B-A3B-MLX-4bit",
"model_path": "Qwen/Qwen3-30B-A3B-MLX-4bit",
"model_type": "lm"
}Note:
config.yamlsupport for launching multiple models at server start is currently in experimental release for a limited set of users. General availability coming soon.
Example config.yaml:
server:
host: "0.0.0.0"
port: 44468
models:
- model_id: "bodega-solomon-9b"
model_type: "multimodal"
model_path: "srswti/bodega-solomon-9b"
max_concurrency: 1
- model_id: "bodega-raptor-8b"
model_type: "lm"
model_path: "srswti/bodega-raptor-8b-mxfp4"
prompt_cache_size: 10import requests
BASE_URL = "http://localhost:44468"
# Load a model
response = requests.post(
f"{BASE_URL}/v1/admin/load-model",
json={
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"context_length": 32768
}
)
print(response.json())
# Chat completion
response = requests.post(
f"{BASE_URL}/v1/chat/completions",
json={
"model": "bodega-raptor-8b",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"max_tokens": 500,
"temperature": 0.7
}
)
print(response.json()["choices"][0]["message"]["content"])Generate text responses using loaded language models. Fully compatible with OpenAI's chat completions API.
Endpoint: POST /v1/chat/completions
curl -X POST http://localhost:44468/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bodega-raptor-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"max_tokens": 1000,
"temperature": 0.7
}'curl -X POST http://localhost:44468/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bodega-raptor-8b",
"messages": [
{"role": "user", "content": "Write a short story about AI."}
],
"stream": true
}'import requests
import json
response = requests.post(
"http://localhost:44468/v1/chat/completions",
json={
"model": "bodega-raptor-8b",
"messages": [
{"role": "user", "content": "Write a short story about AI."}
],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:]
if data != '[DONE]':
chunk = json.loads(data)
content = chunk["choices"][0]["delta"].get("content", "")
if content:
print(content, end="", flush=True)| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | required | Model identifier |
messages |
array | required | Array of message objects with role and content |
max_tokens |
integer | null | Maximum tokens to generate |
temperature |
float | 0.7 | Sampling temperature (0.0 to 2.0) |
top_p |
float | 1.0 | Nucleus sampling parameter |
stream |
boolean | false | Enable streaming responses |
tools |
array | null | Available tools for function calling |
tool_choice |
string/object | "auto" | Control tool selection behavior |
response_format |
object | null | Specify output format (e.g., JSON schema) |
presence_penalty |
float | 0.0 | Penalize new tokens based on presence |
frequency_penalty |
float | 0.0 | Penalize new tokens based on frequency |
stop |
string/array | null | Stop sequences |
seed |
integer | null | Random seed for reproducibility |
{
"id": "chatcmpl_1234567890",
"object": "chat.completion",
"created": 1677652288,
"model": "bodega-raptor-8b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Machine learning is a subset of artificial intelligence..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 120,
"total_tokens": 135
}
}Force the model to output data that strictly adheres to a predefined JSON schema. Constraints are applied natively within the inference engine using outlines.
Endpoint: POST /v1/chat/completions
import requests
schema = {
"type": "json_schema",
"json_schema": {
"name": "AddressExtractor",
"schema": {
"type": "object",
"properties": {
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string", "description": "2 letter abbreviation"},
"zip": {"type": "string", "description": "5 digit zip code"}
},
"required": ["street", "city", "state", "zip"]
}
},
"required": ["address"]
}
}
}
response = requests.post(
"http://localhost:44468/v1/chat/completions",
json={
"model": "bodega-raptor-8b",
"messages": [
{"role": "system", "content": "Extract the address from the user input into the specified JSON format."},
{"role": "user", "content": "Please format this address: 1 Hacker Wy Menlo Park CA 94025"}
],
"response_format": schema,
"stream": False
}
)
# Returns: '{"address": {"street": "1 Hacker Wy", "city": "Menlo Park", "state": "CA", "zip": "94025"}}'
print(response.json()["choices"][0]["message"]["content"])Structured output also works with "stream": true — the model will stream partial JSON tokens as they are generated.
Pass images alongside text prompts for models with vision capabilities such as bodega-solomon-9b.
Endpoint: POST /v1/chat/completions
curl -X POST http://localhost:44468/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "srswti/bodega-solomon-9b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image? Provide a detailed description."
},
{
"type": "image_url",
"image_url": {
"url": "https://weblog.spots.ag/08-2018/aventador_svj_official/th.jpg"
}
}
]
}
],
"max_tokens": 300
}'
import base64
import requests
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
base64_image = encode_image("document_scan.png")
response = requests.post(
"http://localhost:44468/v1/chat/completions",
json={
"model": "bodega-solomon-9b",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Extract all text from this scanned document."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"}
}
]
}
]
}
)
print(response.json()["choices"][0]["message"]["content"])Generate images from text prompts using locally-running image models.
Coming week of March 17. Right now its a experimental release
Endpoint: POST /v1/images/generations
First, load an image generation model using one of the available config_name values:
# Solomon — fast, lightweight generation (recommended starting point)
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "solomon",
"model_type": "image-generation",
"config_name": "solomon"
}'
# Keshav — turbo generation, extremely fast
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "keshav",
"model_type": "image-generation",
"config_name": "keshav"
}'Then generate:
curl -X POST http://localhost:44468/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "solomon",
"prompt": "A highly detailed portrait of a tiny red dragon wearing a chef hat, pulling a fresh loaf of sourdough bread out of a medieval stone oven.",
"size": "1024x1024",
"guidance_scale": 3.5,
"steps": 14,
"seed": 42
}'Available config_name values for image generation: solomon, solomon-max, rehoboam, omri-4b, omri-9b, keshav, kalamkari, fibo.
The engine returns a standard OpenAI-compatible object with a b64_json image payload:
{
"created": 1709428581,
"data": [
{
"b64_json": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}Edit existing images with text instructions using srswti/keshav or srswti/kalamkari.
Coming week of March 17.
Endpoint: POST /v1/images/edits
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "kalamkari-edit",
"model_type": "image-edit",
"config_name": "qwen-image-edit"
}'Bodega is a multi-model registry. You can dynamically spawn, route to, and unload process-isolated model handlers without ever restarting the server.
Spawn a new handler process for a model. It becomes immediately available for inference requests.
Endpoint: POST /v1/admin/load-model
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_id": "bodega-raptor-8b",
"model_type": "lm",
"context_length": 32768,
"max_concurrency": 1,
"prompt_cache_size": 10
}'curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "srswti/solomon",
"model_type": "image-generation",
"config_name": "solomon",
"quantize": 8
}'import requests
response = requests.post(
"http://localhost:44468/v1/admin/load-model",
json={
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"context_length": 32768,
"max_concurrency": 1,
"reasoning_parser": "qwen3",
"tool_call_parser": "qwen3"
}
)
print(response.json())When loading any model from HuggingFace — not just SRSWTI models — use the HuggingFace model card to determine the right model_type. The two most common cases:
text-generation on HuggingFace → model_type: "lm"
These are standard language models that take text in and produce text out. Any model whose HuggingFace page lists the pipeline tag as text-generation should be loaded with "lm".
# Example: a community Qwen text generation model
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "qwen3-8b",
"model_type": "lm",
"model_path": "mlx-community/Qwen3-8B-4bit",
"context_length": 32768
}'image-text-to-text on HuggingFace → model_type: "multimodal"
These are vision-language models that accept both images and text as input. Any model whose HuggingFace page lists the pipeline tag as image-text-to-text should be loaded with "multimodal". This applies to models like Qwen-VL, LLaVA, InternVL, and others.
# Example: a community vision model
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_id": "qwen3.5-27b-vl",
"model_type": "multimodal",
"model_path": "mlx-community/Qwen3.5-27B-4bit",
"context_length": 16384
}'Once loaded as multimodal, you can pass images in the standard image_url content block format just like with bodega-solomon-9b.
| Parameter | Type | Default | Description |
|---|---|---|---|
model_path |
string | required | HuggingFace repo ID or local path |
model_id |
string | null | Alias used in API requests (defaults to path) |
model_type |
string | "lm" | Model type: lm, multimodal, image-generation, image-edit |
context_length |
integer | 32768 | Maximum context length |
max_concurrency |
integer | 1 | Maximum concurrent requests |
queue_timeout |
integer | 300 | Request timeout in seconds |
queue_size |
integer | 100 | Maximum queue size |
quantize |
integer | 8 | Quantization level for Flux models (4, 8, or 16) |
config_name |
string | null | Config for image generation: solomon, solomon-max, rehoboam, omri-4b, omri-9b, keshav, kalamkari, fibo. For editing: flux-kontext-dev, flux2-klein-edit-4b, flux2-klein-edit-9b, qwen-image-edit |
lora_paths |
array | null | Paths to LoRA adapters |
lora_scales |
array | null | Scale factors for LoRA adapters |
disable_auto_resize |
boolean | false | Disable auto-resize for vision models |
enable_auto_tool_choice |
boolean | false | Enable automatic tool selection |
tool_call_parser |
string | null | Parser for tool calls (qwen3, harmony, etc.) |
reasoning_parser |
string | null | Parser for reasoning content (qwen3, harmony, etc.) |
trust_remote_code |
boolean | false | Allow custom model code execution |
chat_template_file |
string | null | Path to custom chat template |
continuous_batching |
boolean | false | Enable high-throughput continuous batching |
cb_max_num_seqs |
integer | 256 | Max sequences in the batching engine |
cb_prefill_batch_size |
integer | 8 | Concurrency limit for prompt ingestion |
cb_completion_batch_size |
integer | 32 | Generation concurrency limit on GPU |
cb_chunked_prefill_tokens |
integer | 2048 | Token chunk size for large prompts |
cb_enable_prefix_cache |
boolean | true | Enable block-aware prompt caching |
draft_model_path |
string | null | Path to draft model for speculative decoding |
num_draft_tokens |
integer | null | Number of tokens for the draft model to guess |
prompt_cache_size |
integer | 0 | Number of prompt cache slots |
Both tool_call_parser and reasoning_parser support: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2.
Gracefully shut down a model's subprocess and unregister it from the engine, instantly freeing its unified GPU/CPU memory. The rest of your loaded models continue running uninterrupted.
Endpoint: DELETE /v1/admin/unload-model/{model_id}
# Unload by model_id
curl -X DELETE http://localhost:44468/v1/admin/unload-model/bodega-raptor-0.9b
# Works with full path model IDs too
curl -X DELETE http://localhost:44468/v1/admin/unload-model/srswti/bodega-orion-0.6bresponse = requests.delete("http://localhost:44468/v1/admin/unload-model/bodega-raptor-0.9b")
print(response.json())Remove a model from your local HuggingFace cache to free disk space. The model must be unloaded first if it is currently running.
Endpoint: DELETE /v1/models/{model_id}
# Delete a locally cached model
curl -X DELETE "http://localhost:44468/v1/models/local/mlx-community/Qwen3.5-27B-4bit"{"id": "mlx-community/Qwen3.5-27B-4bit", "object": "model", "deleted": true}model_id = "SRSWTI/bodega-raptor-8b-mxfp4"
response = requests.delete(f"http://localhost:44468/v1/models/local/{model_id}")
print(response.json())Retrieve real-time Metal Unified Memory and CPU RSS metrics for all running models.
Endpoint: GET /v1/admin/loaded-models
curl http://localhost:44468/v1/admin/loaded-modelsresponse = requests.get("http://localhost:44468/v1/admin/loaded-models")
models = response.json().get("data", [])
for model in models:
print(f"[{model['status'].upper()}] {model['id']} — PID: {model['pid']}")
mem = model.get('memory', {})
print(f" └ Metal Active (GPU): {mem.get('metal_active_mb', 0):.1f} MB")
print(f" └ Process RSS overhead (CPU): {mem.get('rss_mb', 0):.1f} MB")
print(f" └ Total System Pool: {mem.get('total_mb', 0):.1f} MB\n")Response Format:
{
"object": "list",
"data": [
{
"id": "bodega-raptor-8b",
"type": "lm",
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"context_length": 32768,
"created_at": 1704067200,
"status": "running",
"pid": 83932,
"memory": {
"metal_active_mb": 4150.2,
"metal_cache_mb": 0.0,
"metal_peak_mb": 4150.2,
"rss_mb": 408.2,
"total_mb": 4558.4
}
}
],
"total": 1
}Discover, download, and manage models from HuggingFace.
List all models in your local HuggingFace cache.
Endpoint: GET /v1/models
curl http://localhost:44468/v1/models
# Verify download completeness against HuggingFace API
curl "http://localhost:44468/v1/models?verify_with_hub=true"Response Format:
{
"object": "list",
"data": [
{
"id": "SRSWTI/bodega-raptor-8b-mxfp4",
"object": "model",
"created": 1704067200,
"owned_by": "SRSWTI",
"size_gb": 4.8,
"download_percentage": 100.0,
"is_complete": true
}
]
}Download a model to your local cache.
Endpoint: POST /v1/admin/download-model
curl -X POST http://localhost:44468/v1/admin/download-model \
-H "Content-Type: application/json" \
-d '{"model_path": "SRSWTI/bodega-raptor-8b-mxfp4"}'Download with real-time progress via Server-Sent Events.
Endpoint: POST /v1/admin/download-model-stream
import requests, json
response = requests.post(
"http://localhost:44468/v1/admin/download-model-stream",
json={"model_path": "SRSWTI/bodega-raptor-8b-mxfp4"},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: ') and line[6:] != '[DONE]':
progress = json.loads(line[6:])
print(f"{progress['status']} — {progress.get('progress', 0)}%")
if 'current_file' in progress:
print(f" File: {progress['current_file']}")Some models support an explicit reasoning/thinking process. Configure a parser to extract it.
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"reasoning_parser": "qwen3"
}'response = requests.post(
"http://localhost:44468/v1/chat/completions",
json={
"model": "bodega-raptor-8b",
"messages": [{"role": "user", "content": "Solve this logic puzzle: ..."}],
"chat_template_kwargs": {"enable_thinking": True}
}
)
message = response.json()["choices"][0]["message"]
if "reasoning_content" in message:
print("Thinking:", message["reasoning_content"])
print("Answer:", message["content"])Force the model to output valid JSON.
response = requests.post(
"http://localhost:44468/v1/chat/completions",
json={
"model": "bodega-raptor-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant that outputs JSON."},
{"role": "user", "content": "List three colors with their hex codes."}
],
"response_format": {"type": "json_object"}
}
)
import json
result = json.loads(response.json()["choices"][0]["message"]["content"])
print(result)Bodega uses dynamic prompt caching for extremely fast time-to-first-token on recurring sequences. The cache operates natively on MLX token indices — overlapping prefixes across subsequent calls bypass matrix multiplication completely.
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"prompt_cache_size": 25
}'Speculative decoding significantly accelerates generation for large models — especially in single-user, latency-sensitive workloads — without any change to output quality or the response format you receive.
On Apple Silicon, text generation is memory-bandwidth-bound, not compute-bound. For every single token a large model generates, the GPU must load the full set of model weights from unified memory into the compute cores. A 8B parameter model at 4-bit quantization is roughly 4–5GB. Loading those weights once to produce a single token means the vast majority of each generation step is spent on memory transfer, not math. This is why scaling up GPU cores doesn't help much — you're waiting on the memory bus, not the ALUs.
Instead of running the large target model once per token, the engine runs two models in parallel:
- Draft model — a small, fast model (e.g. 0.6B params) that guesses the next
Ntokens very quickly. Because it's tiny, this costs almost nothing. - Target model — the large model you actually want responses from. Instead of generating one token at a time, it evaluates all
Ndraft guesses in a single forward pass using parallel matrix multiplication.
If the target model agrees with the draft's guesses, all N tokens are accepted at once. You get N tokens for the memory-load cost of one. When the target disagrees at position k, it accepts tokens 0 through k-1 and corrects at k, and the draft restarts from there.
In practice, a well-matched draft model (same tokenizer family, same training distribution) agrees on the majority of guesses, yielding effective speedups of 2–3x on generation-heavy workloads without touching output quality. The output is mathematically identical to what the target model would have generated on its own.
The draft model must share the same tokenizer as the target model. Using a model from a different family (e.g. a Llama draft with a Qwen target) will produce garbage. Use a smaller variant from the same model family — for example, a 0.6B or 1B Qwen3 variant to accelerate a 8B or 32B Qwen3 target.
Note: Speculative decoding and continuous batching cannot be used simultaneously. Speculative decoding is optimal for single-user latency. Continuous batching is optimal for multi-user throughput, or multiple concurrency. Choose based on your workload.
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"draft_model_path": "Qwen/Qwen3-0.6B-MLX-4bit",
"num_draft_tokens": 4
}'Or via config.yaml (experimental):
models:
- model_id: "raptor-fast"
model_type: "lm"
model_path: "srswti/bodega-raptor-8b-mxfp4"
draft_model_path: "Qwen/Qwen3-0.6B-MLX-4bit"
num_draft_tokens: 3The response format is identical to a standard completion — no extra fields, no proprietary metrics. The only observable difference is that the payload arrives faster. The completion_tokens count reflects what the target model produced, not the draft speculation.
{
"id": "chatcmpl_2fa419e...",
"object": "chat.completion",
"model": "raptor-fast",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "Here's your answer...",
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 21,
"total_tokens": 121,
"completion_tokens": 100,
"prompt_tokens_details": {
"cached_tokens": 3
}
}
}Bodega's continuous batching engine maximizes throughput for multi-user workloads on Apple Silicon. It is the primary mechanism for serving multiple concurrent users efficiently, and the numbers are dramatic — small SRSWTI models and community models like mlx-community/Qwen3.5-2B-6bit approach ~900 tok/s system throughput on an m4 Max when measured in-process. At the HTTP server layer, measured throughput currently reaches ~600 tok/s — the gap is not the inference engine, it is the HTTP serialization layer, and we are actively working to close it. See the HTTP Bottleneck section below for details.
The Continuous Batching Flow:
- Request A arrives. The engine processes A's prompt and starts generating token 1.
- Request B arrives. Instead of waiting, the engine's Scheduler injects B into the active batch instantly.
- On the very next step, the GPU processes both A's token generation AND B's prompt processing simultaneously.
- The output is streamed back dynamically: token 2 for A, and token 1 for B.
- If Request A hits a stop word and finishes, it is ejected from the batch immediately, freeing up space for Request C, while Request B simply continues generating.
Why this is blazingly fast: Because Apple Silicon is bottlenecked by memory bandwidth during text generation, fetching the model weights accounts for roughly 80% of the time. If you can fetch the weights once and use them to multiply against four different requests simultaneously, you get nearly 4x the throughput with almost zero latency penalty.
This is called "continuous" because requests enter and exit the active GPU batch fluidly as they arrive and finish, without waiting for the whole batch to complete.
The difference is most visible in TTFT (time to first token) under concurrent load. In sequential mode, request 8 waits for requests 1–7 to finish — TTFT grows linearly with queue depth. In continuous batching, all requests are injected into the active batch and begin generating almost immediately.
Benchmarked on the blackbird-she-doesnt-refuse-21b model on M1 MAX 64gb:
| Concurrency | Sequential Mean TTFT | CB Mean TTFT | Sequential Throughput | CB Throughput |
|---|---|---|---|---|
| 4 | 6,510ms | 541ms | 44.4 tok/s | 37.7 tok/s |
| 8 | 12,837ms | 247ms | 44.1 tok/s | 49.2 tok/s |
At concurrency 8, continuous batching delivers a 52x improvement in TTFT — from 12.8 seconds to 247ms. Sequential throughput is flat because it's bottlenecked by single-request speed. CB throughput scales by saturating GPU parallelism across concurrent sequences.
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"continuous_batching": true,
"cb_max_num_seqs": 256,
"cb_prefill_batch_size": 16,
"cb_completion_batch_size": 32
}'Or via config.yaml (experimental):
models:
- model_id: "raptor-batched"
model_type: "lm"
model_path: "srswti/bodega-raptor-8b-mxfp4"
continuous_batching: true
cb_max_num_seqs: 256
cb_prefill_batch_size: 16
cb_completion_batch_size: 32To tune the batching engine, you have 5 main levers:
What it is: The absolute maximum number of sequences (requests) the engine is allowed to hold in its scheduler at one time. How to tune:
- If this is too low, requests will be rejected under heavy load.
- If it's too high, you might run out of KV-cache memory, causing MLX to swap to disk (very slow).
- Set this based on your available RAM. 256 is safe for M1/M2/M3 Max chips (64GB) with 8B models.
What it is: The maximum number of sequences that can be actively generating tokens in the GPU simultaneously. How to tune:
- Above ~32 concurrent generations, you start hitting computation limits on Apple Silicon GPUs, and individual Time-To-First-Token (TTFT) or Time-Per-Output-Token (TPOT) will rise.
- 32 means MLX will multiply the weights against a matrix of size 32 on every generation step.
What it is: When a burst of 50 new requests arrives, how many of them do we inject into the active batch on the very next step? How to tune:
- Prefilling (processing the initial prompt) is computationally heavy. If you try to prefill 50 prompts at once, the GPU hangs for several seconds. If there are other requests currently generating tokens, those users will experience a massive stutter.
- By capping this at 8, we ensure that new requests are digested in small bites. The active generation stream might pause for 100ms instead of 3000ms.
What it is: What if a single user submits a massive 16,000-token prompt? That alone will block the GPU. Chunked prefill solves this by splitting that 16K prompt into 2048-token chunks. How to tune:
- During step 1, it processes chunk 1 (0-2048) alongside the active token generations.
- Step 2: chunk 2 (2048-4096) + active generations.
- This entirely eliminates the "long prompt stutter" problem for concurrent users. Set to 0 to disable.
What it is: Automatic prompt caching. If User A asks a question about a 10,000 token document, the engine calculates the KV-cache and stores it in memory blocks. If User B asks a different question about the exact same document, the engine recognizes the shared prefix and instantly reuses the 10,000 token cache, dropping TTFT from seconds to milliseconds. How to tune: Leave it on. It uses block-aware memory management to automatically evict the oldest prefixes when you hit MLX memory pressure.
So here are the Tuning Parameters
| Parameter | Recommended | What It Controls |
|---|---|---|
cb_max_num_seqs |
256 | Total scheduler capacity — active + waiting sequences combined. Lower this to 64 on 16GB Macs with large models to prevent KV-cache overflow and disk swapping. |
cb_completion_batch_size |
32 | Max concurrent token generations per GPU step. The primary throughput lever. Above ~32 on small models, Apple Silicon hits compute saturation and per-token speed degrades. For 21B+ models, cap at 16. |
cb_prefill_batch_size |
8–16 | How many new prompt-ingestion requests are allowed to enter the active batch per step. This is your TTFT fairness lever. Higher values process bursts faster but can cause brief generation stutter for active streams during the prefill phase. |
cb_chunked_prefill_tokens |
2048 | Splits very long prompts into chunks ingested across multiple steps. Prevents a single massive-context request from freezing generation for everyone else. |
cb_enable_prefix_cache |
true | Block-aware KV-cache. Recognizes shared prefixes across requests (identical system prompts, shared documents) and reuses computed KV blocks, eliminating re-ingestion entirely. |
blackbird-she-doesnt-refuse-21b (Hybrid SWA/Global attention)
| Concurrency | Wall Time | Throughput | Mean TTFT | P95 TTFT |
|---|---|---|---|---|
| 4 | 22.22s | 37.7 tok/s | 541ms | 1463ms |
| 8 | 13.96s | 49.2 tok/s | 247ms | 372ms |
| 16 | 16.71s | 63.7 tok/s | 1444ms | 2880ms |
Peak: 1.69x throughput gain. Gains plateau after concurrency 8 — this model is memory-bandwidth-bound at 21B. TTFT climbs at concurrency 16 as the prefill queue builds up.
deepseek-raptor-32b-4bit
| Concurrency | Wall Time | Throughput | Mean TTFT | P95 TTFT |
|---|---|---|---|---|
| 1 | 188.25s | 8.8 tok/s | 461ms | 1629ms |
| 4 | 161.37s | 9.3 tok/s | 773ms | 1356ms |
| 8 | 145.56s | 10.4 tok/s | 9,802ms | 36,049ms |
| 16 | 162.22s | 9.8 tok/s | 39,025ms | 93,014ms |
Peak: 1.18x gain, marginal. At 32B, this model is heavily compute-bound. Adding batch concurrency provides minimal throughput benefit while TTFT explodes. Recommended concurrency: 1–4.
bodega-raptor-0.9b
| Concurrency | Wall Time | Throughput | Mean TTFT | P95 TTFT |
|---|---|---|---|---|
| 4 | 46.26s | 57.3 tok/s | 234ms | 976ms |
| 8 | 44.41s | 58.5 tok/s | 241ms | 318ms |
| 16 | 32.30s | 80.7 tok/s | 440ms | 567ms |
| 32 | 20.57s | 127.5 tok/s | 923ms | 1455ms |
Peak: 2.23x throughput gain at concurrency 32. Strong scaling characteristic of sub-1B models where the GPU is purely bandwidth-bound.
mlx-community/Qwen3.5-2B-6bit
| Concurrency | Wall Time | Throughput | Mean TTFT | P95 TTFT |
|---|---|---|---|---|
| 4 | 12.96s | 313.3 tok/s | 204ms | 1239ms |
| 8 | 10.13s | 384.2 tok/s | 126ms | 255ms |
| 16 | 9.22s | 447.2 tok/s | 290ms | 458ms |
| 32 | 6.87s | 619.8 tok/s | 596ms | 617ms |
Peak: 1.98x gain at ~620 tok/s measured over HTTP. The ~900 tok/s figure was measured by calling the batching engine directly in-process — no HTTP server, no SSE serialization, no network stack. That number represents the raw inference ceiling on m4 Max. The ~280 tok/s gap you see in the HTTP benchmark is entirely the server layer, not the inference engine. See The HTTP Bottleneck below.
Detailed sweep — mlx-community/Qwen3.5-2B-6bit (mixed and same-query)
| Scenario | Concurrency | Prefill Batch | Mean TTFT | P95 TTFT | Per-Req TPS | System Throughput |
|---|---|---|---|---|---|---|
| Mixed | 8 | 2 | 196ms | 316ms | 58.7 | 450.5 tok/s |
| Mixed | 8 | 4 | 206ms | 289ms | 59.9 | 451.6 tok/s |
| Mixed | 8 | 8 | 258ms | 259ms | 62.4 | 462.6 tok/s |
| Mixed | 16 | 4 | 344ms | 557ms | 36.0 | 534.0 tok/s |
| Mixed | 16 | 8 | 321ms | 425ms | 36.6 | 536.0 tok/s |
| Mixed | 16 | 16 | 331ms | 332ms | 37.3 | 547.4 tok/s |
| Mixed | 32 | 8 | 384ms | 676ms | 31.2 | 889.5 tok/s |
| Mixed | 32 | 16 | 424ms | 616ms | 31.4 | 901.3 tok/s |
| Same query | 8 | 2 | 165ms | 278ms | 60.7 | 439.7 tok/s |
| Same query | 8 | 4 | 144ms | 198ms | 64.0 | 475.8 tok/s |
| Same query | 8 | 8 | 162ms | 162ms | 64.5 | 480.1 tok/s |
| Same query | 16 | 4 | 316ms | 543ms | 35.8 | 517.9 tok/s |
| Same query | 16 | 8 | 269ms | 365ms | 36.8 | 543.8 tok/s |
| Same query | 16 | 16 | 301ms | 302ms | 37.6 | 556.2 tok/s |
| Same query | 32 | 8 | 469ms | 773ms | 31.0 | 870.3 tok/s |
| Same query | 32 | 16 | 467ms | 638ms | 31.8 | 902.2 tok/s |
1. Throughput scales near-linearly with concurrency for small models. Without CB, system throughput equals per-request TPS (~60 tok/s). With CB at concurrency 32, you reach ~900 tok/s system throughput — a 15x total throughput gain on the same hardware.
2. Prefill batch size is a TTFT fairness lever, not a throughput lever. Notice P95 TTFT at concurrency 16: with prefill batch 4, P95 is 557ms — some users are waiting because they're stuck behind multiple prefill rounds. With prefill batch 16, P95 drops to 332ms and mean is 331ms — everyone in the burst gets their first token at nearly the same time. The rule: if you expect burst traffic (many requests arriving simultaneously), set a higher prefill batch. If requests arrive organically over time, a lower prefill batch keeps active generation streams smoother.
3. Prefix caching is a meaningful TTFT accelerator. At concurrency 8, mixed queries average 196–258ms TTFT. The same query (shared prefix, cache hit for all subsequent requests) drops to 144ms mean TTFT with a P95 of 198ms. The engine computed the prompt KV-cache once and reused it across all 8 requests. Per-request TPS also climbs from ~59 to ~64.5 because subsequent requests skip prompt ingestion entirely.
4. Large models (21B+) have a concurrency sweet spot. For the 32B model, optimal concurrency is 1–4. Pushing to 8+ concurrent requests causes TTFT to spike into the tens of seconds — the GPU is compute-saturated and the KV-cache grows large enough to risk swapping. For 21B models, concurrency 8 is the practical ceiling before TTFT becomes unacceptable for real-time users.
For small and mid-size models, the batching engine is fast enough that the HTTP server itself becomes the bottleneck — a situation that is uncommon in most inference systems and speaks to how aggressively the Bodega inference engine saturates Apple Silicon's memory bandwidth.
What's happening: When the batching engine generates tokens, it produces them in steps. Each step generates one token per active sequence simultaneously, then the output needs to be: serialized to JSON, wrapped in a Server-Sent Events data: frame, written to each open HTTP response stream, and flushed through the OS network stack. For large models generating at 8–30 tok/s, this overhead is negligible. For a model running at 900 tok/s in-process across 32 concurrent streams, each engine step completes in milliseconds — and the HTTP layer starts struggling to keep up with the token emission rate.
The measured gap on M4 Max with Qwen3.5-2B-6bit:
| Mode | Throughput |
|---|---|
| In-process (direct engine call, no HTTP) | ~900 tok/s |
HTTP with streaming (text/event-stream) |
~600 tok/s |
| Gap | ~300 tok/s (~33% overhead) |
A note on the measured 600 tok/s figure: This was recorded on a live macOS system, not an isolated benchmark environment. Apple Silicon's unified memory architecture makes this more significant than it would be on a discrete GPU system. On a dedicated GPU, inference has its own VRAM and the CPU/system RAM is separate. On Apple Silicon, everything — the inference engine, WindowServer, your browser's GPU process, Electron renderers — shares the same memory bus and the same Metal command queue. So a busy Electron app isn't just using CPU, it's genuinely competing for the same memory bandwidth that the inference engine depends on. The true HTTP ceiling on a fully idle machine may be measurably higher than 600 tok/s. The in-process ~900 tok/s figure is a tighter measurement by comparison since it bypasses the HTTP layer entirely, but both numbers should be treated as real-world approximations rather than hardware ceilings.
The ~300 tok/s gap is not lost inference work — the GPU is generating tokens at the same rate regardless. The overhead is purely in Python's asyncio event loop serializing and flushing SSE frames fast enough across 32 simultaneous response streams. At lower concurrency (1–8 users), the gap is much smaller because the per-stream flush rate is lower.
Refined analysis
The core claim — that Python asyncio becomes the bottleneck before the inference engine does at high concurrency — is technically sound. SSE per-token flushing is genuinely expensive, and vLLM, llama.cpp server, and TGI have all documented this. The phenomenon is real.
What needs to be corrected or made more precise: the original says the ~300 tok/s gap is "purely in Python's asyncio event loop serializing and flushing SSE frames." That's incomplete. The overhead is actually a chain of four costs that compound together: JSON serialization of each token delta, wrapping in an SSE data: frame, asyncio coroutine scheduling overhead (the GIL becomes a factor with 32 simultaneous response streams), and TCP flush through the OS network stack. Attributing it all to "asyncio event loop" understates the full picture.
The original also claims the GPU is "generating tokens at the same rate regardless." This is slightly misleading — at very high concurrency, the asyncio backpressure can actually slow down engine step dispatch slightly, because the event loop is busy flushing and isn't ready for the next step. The GPU isn't entirely independent.
Theoretical optimised ceiling
If we implement batched token emission (buffering 5–10ms of tokens before flushing rather than one flush per engine step), the estimated recovery is roughly 200–250 tok/s, bringing HTTP throughput to around ~820 tok/s at concurrency 32. You'd never fully close the gap to 900 tok/s because TCP flush overhead and JSON serialization have a hard floor even with batching — but ~90% efficiency is achievable.
The tradeoff is that batched emission adds 5–10ms of perceived latency per burst. At high concurrency that's completely invisible. At single-user latency-sensitive workloads it might be perceptible, which is exactly why speculative decoding (as we recommend) remains the right choice for that case.
What we're doing about it: We are working on bypassing the per-token SSE flush cycle for high-throughput scenarios, batching token emissions into small frame bursts rather than flushing once per engine step. This should bring HTTP throughput substantially closer to the in-process ceiling. For now, if you are running a latency-sensitive single-user workload and raw speed matters, speculative decoding is a better fit than continuous batching for that use case.
Small models (90M–8B) on any Mac with 16GB+ RAM:
cb_max_num_seqs: 256cb_completion_batch_size: 32cb_prefill_batch_size: 16
Large models (14B–32B) on 32GB+ RAM:
cb_max_num_seqs: 64cb_completion_batch_size: 16cb_prefill_batch_size: 4–8
Override a model's default chat template:
curl -X POST http://localhost:44468/v1/admin/load-model \
-H "Content-Type: application/json" \
-d '{
"model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
"model_type": "lm",
"chat_template_file": "/path/to/custom_template.jinja"
}'Endpoint: GET /health
curl http://localhost:44468/healthHealthy (multi-model):
{
"status": "ok",
"model_id": "bodega-solomon-9b, bodega-raptor-8b",
"model_status": "initialized (2 model(s))",
"models_detail": [
{"id": "bodega-solomon-9b", "type": "multimodal", "status": "running", "ram_usage_mb": 11645.8},
{"id": "bodega-raptor-8b", "type": "lm", "status": "running", "ram_usage_mb": 4558.4}
]
}No models loaded:
{
"status": "unhealthy",
"model_id": null,
"model_status": "no_models"
}Endpoint: GET /v1/queue/stats
curl http://localhost:44468/v1/queue/statsresponse = requests.get("http://localhost:44468/v1/queue/stats")
stats = response.json()["queue_stats"]
print(f"Queue size: {stats.get('queue_size', 0)}")
print(f"Active requests: {stats.get('active_requests', 0)}")Our Open source Work
- Explore our Models: Hugging Face
- Coding CLI: axe on GitHub
Fastest (edge/laptop):
srswti/bodega-orion-0.6b— Sub-100M params, exceptional tool calling and reasoning at the edgehttps://huggingface.co/srswtiSRSWTI/bodega-raptor-0.9b— 400+ tok/s, ideal for classification and query reformulationSRSWTI/axe-turbo-1b— Sub-50ms first token, edge-first agentic coding
Balanced performance:
SRSWTI/bodega-raptor-1b-reasoning-opus4.5-distill— Distilled from Claude Opus 4.5 reasoning patternsSRSWTI/bodega-vertex-4b— Optimized for structured data processingSRSWTI/bodega-raptor-8b-mxfp4— Best general-purpose choice for laptops
Multimodal and agentic:
SRSWTI/bodega-solomon-9b— Vision + best-in-class agentic coding workflows
High capacity:
SRSWTI/bodega-raptor-15b-6bit— Enhanced Raptor variantSRSWTI/bodega-centenario-21b-mxfp4— Production workhorse, 21B params optimized for sustained workloadsSRSWTI/blackbird-she-doesnt-refuse-21b— Uncensored 21B for unrestricted generationSRSWTI/axe-turbo-31b— High-capacity desktop/server variant with agentic coding focus
Flagship intelligence:
SRSWTI/deepseek-v3.2-speciale-distilled-raptor-32b-4bit— DeepSeek V3.2 distilled to 32B with Raptor reasoning. Exceptional math and code generation in a 5–7GB footprint. 120 tok/s on m4 Max.
- Use the smallest context length that fits your use case
- Unload models you're not actively using to free unified memory
- Monitor queue stats to avoid overloading the scheduler
- Prefer quantized (4-bit or 8-bit) models for better memory efficiency
- Set
max_concurrency: 1for single-user scenarios - Use streaming for long responses to improve perceived latency
- Enable
prompt_cache_sizefor workloads with recurring prefixes - Use speculative decoding for single-user, latency-sensitive workloads
- Use continuous batching for multi-user, throughput-sensitive workloads
response = requests.post("http://localhost:44468/v1/chat/completions", json={...})
if response.status_code == 503:
print("No model loaded. Load a model first.")
elif response.status_code == 400:
print("Invalid request parameters.")
elif response.status_code == 200:
result = response.json()
else:
print(f"Error: {response.status_code}")Bodega includes a fully self-contained RAG pipeline for PDF documents
Endpoint: POST /v1/rag/upload
curl -X POST http://localhost:44468/v1/rag/upload \
-F "file=@/path/to/your/document.pdf"Response:
{
"file_id": "rag-c6cd8f10",
"filename": "document.pdf",
"num_chunks": 71,
"status": "indexed"
}The engine embeds your question, retrieves the most relevant chunks via FAISS cosine-similarity, and passes the context alongside your query to the active chat model.
Endpoint: POST /v1/rag/query
curl -X POST http://localhost:44468/v1/rag/query \
-H "Content-Type: application/json" \
-d '{
"file_id": "rag-c6cd8f10",
"query": "What is the main conclusion of this document?",
"model": "bodega-raptor-8b",
"top_k": 5
}'Add "stream": true to receive the answer as a Server-Sent Events stream, identical to the standard /v1/chat/completions endpoint.
Endpoint: GET /v1/rag/documents
curl http://localhost:44468/v1/rag/documentsEndpoint: DELETE /v1/rag/documents/{file_id}
curl -X DELETE http://localhost:44468/v1/rag/documents/rag-c6cd8f10- The server runs on
localhost:44468only and is not accessible from external networks - No authentication is required for local access
- Do not expose this port to the internet without adding proper security measures
- Only set
trust_remote_code: truefor models from verified sources
Documentation last updated: March 2026


