Bodega Inference Engine

What we recommend

The easiest way to get started is by using our interactive setup script. It will configure terminal-based monitoring tool, download your first model, and let you test the engine's capabilities via benchmarks or the interactive chat shell.

# Make the script executable
chmod +x setup.sh

# Run the interactive setup, honestly this is the best way to get started
./setup.sh

Benchmarks & Leaderboard

Run the Runtime Comparison (LM Studio vs Bodega Inference Engine) with leaderboard upload:

# Ensure the same model is loaded in LM Studio with Max Concurrent Predictions = 32
python compare_engines.py --model srswti/bodega-orion-0.6b \
    --lmstudio-model-id bodega-orion-0.6b \
    --output results/compare_$(date +%Y%m%d_%H%M%S).json \
    --leaderboard-url https://leaderboard.srswti.com

Use --lmstudio-model-id to match the model ID shown in LM Studio (often the short name, e.g. bodega-orion-0.6b). Results are posted to the global leaderboard.

Bodega Inference Engine

Bodega Inference Engine delivers enterprise-grade inference directly on your machine. Built specifically for Apple Silicon, it provides a seamless runtime with openai-compatible api which is faster, more memory efficient and intuitive than any other runtime available out there.

Architecture: Multi-process isolated handler architecture prevents Metal memory leaks.

As of the latest release, Bodega is a multi-model registry — you can load, route to, and unload multiple models simultaneously, each running in its own hardware-isolated subprocess. The engine automatically handles resource allocation and delivers the fastest possible inference on Apple Silicon.

Key Capabilities:

Multi-model registry with dynamic loading and unloading
Language model inference with streaming support
Multimodal language model support (vision)
Image generation (live next week, week of March 17)
Image editing (live next week, week of March 17)
Structured output via JSON schema constraints
Speculative decoding for accelerated generation
Continuous batching for high-throughput workloads
Built-in prompt caching

Getting Started

Quick Implement

Start the server and load your first model:

# Or dynamically load a model via API
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_id": "bodega-raptor-8b",
    "model_type": "lm",
    "context_length": 32768,
    "prompt_cache_size": 10
  }'

# Make your first inference request
curl -X POST http://localhost:44468/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bodega-raptor-8b",
    "messages": [
      {"role": "user", "content": "Hello, welcome to the world of dreamers?"}
    ]
  }'

Quick Start (Multi-Model Registry)

The standard way to run multiple models is to keep calling the /v1/admin/load-model endpoint — each call spawns a new isolated subprocess for that model. You can check which models are running and their memory usage at any time via /health:

# Load a language model
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "bodega-orion-0.6b",
    "model_type": "lm",
    "model_path": "srswti/bodega-orion-0.6b"
  }'

# Load a multimodal model alongside it
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "srswti/bodega-solomon-9b",
    "model_type": "multimodal",
    "model_path": "srswti/bodega-solomon-9b"
  }'

# Load our favourite model alongside it :)
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "blackbird",
    "model_type": "lm",
    "model_path": "srswti/blackbird-she-doesnt-refuse-21b",
    "context_length": 32768,
    "max_concurrency": 1,
    "reasoning_parser": "harmony",
    "tool_call_parser": "harmony"
  }'

# Check what's running
curl http://localhost:44468/health

{
  "status": "ok",
  "model_id": "bodega-raptor-0.9b, srswti/bodega-solomon-9b",
  "model_status": "initialized (2 model(s))",
  "models_detail": [
    {
      "id": "bodega-raptor-0.9b",
      "type": "lm",
      "status": "running",
      "ram_usage_mb": 3667.1
    },
    {
      "id": "srswti/bodega-solomon-9b",
      "type": "multimodal",
      "status": "running",
      "ram_usage_mb": 13344.5
    }
  ]
}

You can also load any HuggingFace model directly — not just SRSWTI models. For example, loading a community Qwen model with continuous batching:

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "Qwen/Qwen3-30B-A3B-MLX-4bit",
    "model_type": "lm",
    "model_path": "Qwen/Qwen3-30B-A3B-MLX-4bit",
    "max_concurrency": 1,
    "queue_timeout": 300,
    "queue_size": 100,
    "continuous_batching": true,
    "cb_max_num_seqs": 256,
    "cb_prefill_batch_size": 16,
    "cb_completion_batch_size": 32
  }' | python3 -m json.tool

{
  "status": "loaded",
  "model_id": "Qwen/Qwen3-30B-A3B-MLX-4bit",
  "model_path": "Qwen/Qwen3-30B-A3B-MLX-4bit",
  "model_type": "lm"
}

Note: config.yaml support for launching multiple models at server start is currently in experimental release for a limited set of users. General availability coming soon.

Example config.yaml:

server:
  host: "0.0.0.0"
  port: 44468

models:
  - model_id: "bodega-solomon-9b"
    model_type: "multimodal"
    model_path: "srswti/bodega-solomon-9b"
    max_concurrency: 1

  - model_id: "bodega-raptor-8b"
    model_type: "lm"
    model_path: "srswti/bodega-raptor-8b-mxfp4"
    prompt_cache_size: 10

Python Quick Start

import requests

BASE_URL = "http://localhost:44468"

# Load a model
response = requests.post(
    f"{BASE_URL}/v1/admin/load-model",
    json={
        "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
        "model_type": "lm",
        "context_length": 32768
    }
)
print(response.json())

# Chat completion
response = requests.post(
    f"{BASE_URL}/v1/chat/completions",
    json={
        "model": "bodega-raptor-8b",
        "messages": [
            {"role": "user", "content": "Explain quantum computing in simple terms."}
        ],
        "max_tokens": 500,
        "temperature": 0.7
    }
)
print(response.json()["choices"][0]["message"]["content"])

Core Endpoints

Chat Completions

Generate text responses using loaded language models. Fully compatible with OpenAI's chat completions API.

Endpoint: POST /v1/chat/completions

Basic Request

curl -X POST http://localhost:44468/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bodega-raptor-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "max_tokens": 1000,
    "temperature": 0.7
  }'

Streaming Response

curl -X POST http://localhost:44468/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bodega-raptor-8b",
    "messages": [
      {"role": "user", "content": "Write a short story about AI."}
    ],
    "stream": true
  }'

Python Streaming Example

import requests
import json

response = requests.post(
    "http://localhost:44468/v1/chat/completions",
    json={
        "model": "bodega-raptor-8b",
        "messages": [
            {"role": "user", "content": "Write a short story about AI."}
        ],
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: '):
            data = line[6:]
            if data != '[DONE]':
                chunk = json.loads(data)
                content = chunk["choices"][0]["delta"].get("content", "")
                if content:
                    print(content, end="", flush=True)

Request Parameters

Parameter	Type	Default	Description
`model`	string	required	Model identifier
`messages`	array	required	Array of message objects with role and content
`max_tokens`	integer	null	Maximum tokens to generate
`temperature`	float	0.7	Sampling temperature (0.0 to 2.0)
`top_p`	float	1.0	Nucleus sampling parameter
`stream`	boolean	false	Enable streaming responses
`tools`	array	null	Available tools for function calling
`tool_choice`	string/object	"auto"	Control tool selection behavior
`response_format`	object	null	Specify output format (e.g., JSON schema)
`presence_penalty`	float	0.0	Penalize new tokens based on presence
`frequency_penalty`	float	0.0	Penalize new tokens based on frequency
`stop`	string/array	null	Stop sequences
`seed`	integer	null	Random seed for reproducibility

Response Format

{
  "id": "chatcmpl_1234567890",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "bodega-raptor-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a subset of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 120,
    "total_tokens": 135
  }
}

Structured Outputs (JSON Schema)

Force the model to output data that strictly adheres to a predefined JSON schema. Constraints are applied natively within the inference engine using outlines.

Endpoint: POST /v1/chat/completions

import requests

schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "AddressExtractor",
        "schema": {
            "type": "object",
            "properties": {
                "address": {
                    "type": "object",
                    "properties": {
                        "street": {"type": "string"},
                        "city": {"type": "string"},
                        "state": {"type": "string", "description": "2 letter abbreviation"},
                        "zip": {"type": "string", "description": "5 digit zip code"}
                    },
                    "required": ["street", "city", "state", "zip"]
                }
            },
            "required": ["address"]
        }
    }
}

response = requests.post(
    "http://localhost:44468/v1/chat/completions",
    json={
        "model": "bodega-raptor-8b",
        "messages": [
            {"role": "system", "content": "Extract the address from the user input into the specified JSON format."},
            {"role": "user", "content": "Please format this address: 1 Hacker Wy Menlo Park CA 94025"}
        ],
        "response_format": schema,
        "stream": False
    }
)

# Returns: '{"address": {"street": "1 Hacker Wy", "city": "Menlo Park", "state": "CA", "zip": "94025"}}'
print(response.json()["choices"][0]["message"]["content"])

Structured output also works with "stream": true — the model will stream partial JSON tokens as they are generated.

Multimodal Completions (Vision)

Pass images alongside text prompts for models with vision capabilities such as bodega-solomon-9b.

Endpoint: POST /v1/chat/completions

URL Image

curl -X POST http://localhost:44468/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ 
    "model": "srswti/bodega-solomon-9b", 
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image? Provide a detailed description."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://weblog.spots.ag/08-2018/aventador_svj_official/th.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

Local Base64 Image

import base64
import requests

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("document_scan.png")

response = requests.post(
    "http://localhost:44468/v1/chat/completions",
    json={
        "model": "bodega-solomon-9b",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract all text from this scanned document."},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                    }
                ]
            }
        ]
    }
)
print(response.json()["choices"][0]["message"]["content"])

Image Generation

Generate images from text prompts using locally-running image models.

Coming week of March 17. Right now its a experimental release

Endpoint: POST /v1/images/generations

First, load an image generation model using one of the available config_name values:

# Solomon — fast, lightweight generation (recommended starting point)
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "solomon",
    "model_type": "image-generation",
    "config_name": "solomon"
  }'

# Keshav — turbo generation, extremely fast
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "keshav",
    "model_type": "image-generation",
    "config_name": "keshav"
  }'

Then generate:

curl -X POST http://localhost:44468/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "solomon",
    "prompt": "A highly detailed portrait of a tiny red dragon wearing a chef hat, pulling a fresh loaf of sourdough bread out of a medieval stone oven.",
    "size": "1024x1024",
    "guidance_scale": 3.5,
    "steps": 14,
    "seed": 42
  }'

Available config_name values for image generation: solomon, solomon-max, rehoboam, omri-4b, omri-9b, keshav, kalamkari, fibo.

The engine returns a standard OpenAI-compatible object with a b64_json image payload:

{
  "created": 1709428581,
  "data": [
    {
      "b64_json": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
}

Image Editing

Edit existing images with text instructions using srswti/keshav or srswti/kalamkari.

Coming week of March 17.

Endpoint: POST /v1/images/edits

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "kalamkari-edit",
    "model_type": "image-edit",
    "config_name": "qwen-image-edit"
  }'

Model Management

Bodega is a multi-model registry. You can dynamically spawn, route to, and unload process-isolated model handlers without ever restarting the server.

Load Model

Spawn a new handler process for a model. It becomes immediately available for inference requests.

Endpoint: POST /v1/admin/load-model

Load a Language Model

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_id": "bodega-raptor-8b",
    "model_type": "lm",
    "context_length": 32768,
    "max_concurrency": 1,
    "prompt_cache_size": 10
  }'

Load an Image Generation Model

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "srswti/solomon",
    "model_type": "image-generation",
    "config_name": "solomon",
    "quantize": 8
  }'

Python Example

import requests

response = requests.post(
    "http://localhost:44468/v1/admin/load-model",
    json={
        "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
        "model_type": "lm",
        "context_length": 32768,
        "max_concurrency": 1,
        "reasoning_parser": "qwen3",
        "tool_call_parser": "qwen3"
    }
)
print(response.json())

Mapping HuggingFace Model Types to `model_type`

When loading any model from HuggingFace — not just SRSWTI models — use the HuggingFace model card to determine the right model_type. The two most common cases:

text-generation on HuggingFace → model_type: "lm"

These are standard language models that take text in and produce text out. Any model whose HuggingFace page lists the pipeline tag as text-generation should be loaded with "lm".

# Example: a community Qwen text generation model
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "qwen3-8b",
    "model_type": "lm",
    "model_path": "mlx-community/Qwen3-8B-4bit",
    "context_length": 32768
  }'

image-text-to-text on HuggingFace → model_type: "multimodal"

These are vision-language models that accept both images and text as input. Any model whose HuggingFace page lists the pipeline tag as image-text-to-text should be loaded with "multimodal". This applies to models like Qwen-VL, LLaVA, InternVL, and others.

# Example: a community vision model
curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "qwen3.5-27b-vl",
    "model_type": "multimodal",
    "model_path": "mlx-community/Qwen3.5-27B-4bit",
    "context_length": 16384
  }'

Once loaded as multimodal, you can pass images in the standard image_url content block format just like with bodega-solomon-9b.

Parameter	Type	Default	Description
`model_path`	string	required	HuggingFace repo ID or local path
`model_id`	string	null	Alias used in API requests (defaults to path)
`model_type`	string	"lm"	Model type: `lm`, `multimodal`, `image-generation`, `image-edit`
`context_length`	integer	32768	Maximum context length
`max_concurrency`	integer	1	Maximum concurrent requests
`queue_timeout`	integer	300	Request timeout in seconds
`queue_size`	integer	100	Maximum queue size
`quantize`	integer	8	Quantization level for Flux models (4, 8, or 16)
`config_name`	string	null	Config for image generation: `solomon`, `solomon-max`, `rehoboam`, `omri-4b`, `omri-9b`, `keshav`, `kalamkari`, `fibo`. For editing: `flux-kontext-dev`, `flux2-klein-edit-4b`, `flux2-klein-edit-9b`, `qwen-image-edit`
`lora_paths`	array	null	Paths to LoRA adapters
`lora_scales`	array	null	Scale factors for LoRA adapters
`disable_auto_resize`	boolean	false	Disable auto-resize for vision models
`enable_auto_tool_choice`	boolean	false	Enable automatic tool selection
`tool_call_parser`	string	null	Parser for tool calls (`qwen3`, `harmony`, etc.)
`reasoning_parser`	string	null	Parser for reasoning content (`qwen3`, `harmony`, etc.)
`trust_remote_code`	boolean	false	Allow custom model code execution
`chat_template_file`	string	null	Path to custom chat template
`continuous_batching`	boolean	false	Enable high-throughput continuous batching
`cb_max_num_seqs`	integer	256	Max sequences in the batching engine
`cb_prefill_batch_size`	integer	8	Concurrency limit for prompt ingestion
`cb_completion_batch_size`	integer	32	Generation concurrency limit on GPU
`cb_chunked_prefill_tokens`	integer	2048	Token chunk size for large prompts
`cb_enable_prefix_cache`	boolean	true	Enable block-aware prompt caching
`draft_model_path`	string	null	Path to draft model for speculative decoding
`num_draft_tokens`	integer	null	Number of tokens for the draft model to guess
`prompt_cache_size`	integer	0	Number of prompt cache slots

Available Parsers

Both tool_call_parser and reasoning_parser support: qwen3, glm4_moe, qwen3_coder, qwen3_moe, qwen3_next, qwen3_vl, harmony, minimax_m2.

Unload Model

Gracefully shut down a model's subprocess and unregister it from the engine, instantly freeing its unified GPU/CPU memory. The rest of your loaded models continue running uninterrupted.

Endpoint: DELETE /v1/admin/unload-model/{model_id}

# Unload by model_id
curl -X DELETE http://localhost:44468/v1/admin/unload-model/bodega-raptor-0.9b

# Works with full path model IDs too
curl -X DELETE http://localhost:44468/v1/admin/unload-model/srswti/bodega-orion-0.6b

response = requests.delete("http://localhost:44468/v1/admin/unload-model/bodega-raptor-0.9b")
print(response.json())

Delete Model

Remove a model from your local HuggingFace cache to free disk space. The model must be unloaded first if it is currently running.

Endpoint: DELETE /v1/models/{model_id}

# Delete a locally cached model
curl -X DELETE "http://localhost:44468/v1/models/local/mlx-community/Qwen3.5-27B-4bit"

{"id": "mlx-community/Qwen3.5-27B-4bit", "object": "model", "deleted": true}

model_id = "SRSWTI/bodega-raptor-8b-mxfp4"
response = requests.delete(f"http://localhost:44468/v1/models/local/{model_id}")
print(response.json())

List Loaded Models & Memory Usage

Retrieve real-time Metal Unified Memory and CPU RSS metrics for all running models.

Endpoint: GET /v1/admin/loaded-models

curl http://localhost:44468/v1/admin/loaded-models

response = requests.get("http://localhost:44468/v1/admin/loaded-models")
models = response.json().get("data", [])

for model in models:
    print(f"[{model['status'].upper()}] {model['id']} — PID: {model['pid']}")
    mem = model.get('memory', {})
    print(f"  └ Metal Active (GPU): {mem.get('metal_active_mb', 0):.1f} MB")
    print(f"  └ Process RSS overhead (CPU): {mem.get('rss_mb', 0):.1f} MB")
    print(f"  └ Total System Pool: {mem.get('total_mb', 0):.1f} MB\n")

Response Format:

{
  "object": "list",
  "data": [
    {
      "id": "bodega-raptor-8b",
      "type": "lm",
      "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
      "context_length": 32768,
      "created_at": 1704067200,
      "status": "running",
      "pid": 83932,
      "memory": {
        "metal_active_mb": 4150.2,
        "metal_cache_mb": 0.0,
        "metal_peak_mb": 4150.2,
        "rss_mb": 408.2,
        "total_mb": 4558.4
      }
    }
  ],
  "total": 1
}

Model Discovery

Discover, download, and manage models from HuggingFace.

List Available Models

List all models in your local HuggingFace cache.

Endpoint: GET /v1/models

curl http://localhost:44468/v1/models

# Verify download completeness against HuggingFace API
curl "http://localhost:44468/v1/models?verify_with_hub=true"

Response Format:

{
  "object": "list",
  "data": [
    {
      "id": "SRSWTI/bodega-raptor-8b-mxfp4",
      "object": "model",
      "created": 1704067200,
      "owned_by": "SRSWTI",
      "size_gb": 4.8,
      "download_percentage": 100.0,
      "is_complete": true
    }
  ]
}

Download Model

Download a model to your local cache.

Endpoint: POST /v1/admin/download-model

curl -X POST http://localhost:44468/v1/admin/download-model \
  -H "Content-Type: application/json" \
  -d '{"model_path": "SRSWTI/bodega-raptor-8b-mxfp4"}'

Download Model with Progress

Download with real-time progress via Server-Sent Events.

Endpoint: POST /v1/admin/download-model-stream

import requests, json

response = requests.post(
    "http://localhost:44468/v1/admin/download-model-stream",
    json={"model_path": "SRSWTI/bodega-raptor-8b-mxfp4"},
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode('utf-8')
        if line.startswith('data: ') and line[6:] != '[DONE]':
            progress = json.loads(line[6:])
            print(f"{progress['status']} — {progress.get('progress', 0)}%")
            if 'current_file' in progress:
                print(f"  File: {progress['current_file']}")

Advanced Features

Reasoning Models

Some models support an explicit reasoning/thinking process. Configure a parser to extract it.

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_type": "lm",
    "reasoning_parser": "qwen3"
  }'

response = requests.post(
    "http://localhost:44468/v1/chat/completions",
    json={
        "model": "bodega-raptor-8b",
        "messages": [{"role": "user", "content": "Solve this logic puzzle: ..."}],
        "chat_template_kwargs": {"enable_thinking": True}
    }
)

message = response.json()["choices"][0]["message"]

if "reasoning_content" in message:
    print("Thinking:", message["reasoning_content"])

print("Answer:", message["content"])

JSON Mode

Force the model to output valid JSON.

response = requests.post(
    "http://localhost:44468/v1/chat/completions",
    json={
        "model": "bodega-raptor-8b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant that outputs JSON."},
            {"role": "user", "content": "List three colors with their hex codes."}
        ],
        "response_format": {"type": "json_object"}
    }
)

import json
result = json.loads(response.json()["choices"][0]["message"]["content"])
print(result)

Prompt Caching

Bodega uses dynamic prompt caching for extremely fast time-to-first-token on recurring sequences. The cache operates natively on MLX token indices — overlapping prefixes across subsequent calls bypass matrix multiplication completely.

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_type": "lm",
    "prompt_cache_size": 25
  }'

Speculative Decoding

Speculative decoding significantly accelerates generation for large models — especially in single-user, latency-sensitive workloads — without any change to output quality or the response format you receive.

Why generation is slow on large models

On Apple Silicon, text generation is memory-bandwidth-bound, not compute-bound. For every single token a large model generates, the GPU must load the full set of model weights from unified memory into the compute cores. A 8B parameter model at 4-bit quantization is roughly 4–5GB. Loading those weights once to produce a single token means the vast majority of each generation step is spent on memory transfer, not math. This is why scaling up GPU cores doesn't help much — you're waiting on the memory bus, not the ALUs.

What speculative decoding does

Instead of running the large target model once per token, the engine runs two models in parallel:

Draft model — a small, fast model (e.g. 0.6B params) that guesses the next N tokens very quickly. Because it's tiny, this costs almost nothing.
Target model — the large model you actually want responses from. Instead of generating one token at a time, it evaluates all N draft guesses in a single forward pass using parallel matrix multiplication.

If the target model agrees with the draft's guesses, all N tokens are accepted at once. You get N tokens for the memory-load cost of one. When the target disagrees at position k, it accepts tokens 0 through k-1 and corrects at k, and the draft restarts from there.

In practice, a well-matched draft model (same tokenizer family, same training distribution) agrees on the majority of guesses, yielding effective speedups of 2–3x on generation-heavy workloads without touching output quality. The output is mathematically identical to what the target model would have generated on its own.

Requirements

The draft model must share the same tokenizer as the target model. Using a model from a different family (e.g. a Llama draft with a Qwen target) will produce garbage. Use a smaller variant from the same model family — for example, a 0.6B or 1B Qwen3 variant to accelerate a 8B or 32B Qwen3 target.

Note: Speculative decoding and continuous batching cannot be used simultaneously. Speculative decoding is optimal for single-user latency. Continuous batching is optimal for multi-user throughput, or multiple concurrency. Choose based on your workload.

Configuration

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_type": "lm",
    "draft_model_path": "Qwen/Qwen3-0.6B-MLX-4bit",
    "num_draft_tokens": 4
  }'

Or via config.yaml (experimental):

models:
  - model_id: "raptor-fast"
    model_type: "lm"
    model_path: "srswti/bodega-raptor-8b-mxfp4"
    draft_model_path: "Qwen/Qwen3-0.6B-MLX-4bit"
    num_draft_tokens: 3

Response

The response format is identical to a standard completion — no extra fields, no proprietary metrics. The only observable difference is that the payload arrives faster. The completion_tokens count reflects what the target model produced, not the draft speculation.

{
  "id": "chatcmpl_2fa419e...",
  "object": "chat.completion",
  "model": "raptor-fast",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Here's your answer...",
        "role": "assistant"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 121,
    "completion_tokens": 100,
    "prompt_tokens_details": {
      "cached_tokens": 3
    }
  }
}

Continuous Batching (High Throughput)

Bodega's continuous batching engine maximizes throughput for multi-user workloads on Apple Silicon. It is the primary mechanism for serving multiple concurrent users efficiently, and the numbers are dramatic — small SRSWTI models and community models like mlx-community/Qwen3.5-2B-6bit approach ~900 tok/s system throughput on an m4 Max when measured in-process. At the HTTP server layer, measured throughput currently reaches ~600 tok/s — the gap is not the inference engine, it is the HTTP serialization layer, and we are actively working to close it. See the HTTP Bottleneck section below for details.

How It Works

The Continuous Batching Flow:

Request A arrives. The engine processes A's prompt and starts generating token 1.
Request B arrives. Instead of waiting, the engine's Scheduler injects B into the active batch instantly.
On the very next step, the GPU processes both A's token generation AND B's prompt processing simultaneously.
The output is streamed back dynamically: token 2 for A, and token 1 for B.
If Request A hits a stop word and finishes, it is ejected from the batch immediately, freeing up space for Request C, while Request B simply continues generating.

Why this is blazingly fast: Because Apple Silicon is bottlenecked by memory bandwidth during text generation, fetching the model weights accounts for roughly 80% of the time. If you can fetch the weights once and use them to multiply against four different requests simultaneously, you get nearly 4x the throughput with almost zero latency penalty.

This is called "continuous" because requests enter and exit the active GPU batch fluidly as they arrive and finish, without waiting for the whole batch to complete.

Sequential vs. Continuous Batching

The difference is most visible in TTFT (time to first token) under concurrent load. In sequential mode, request 8 waits for requests 1–7 to finish — TTFT grows linearly with queue depth. In continuous batching, all requests are injected into the active batch and begin generating almost immediately.

Benchmarked on the blackbird-she-doesnt-refuse-21b model on M1 MAX 64gb:

Concurrency	Sequential Mean TTFT	CB Mean TTFT	Sequential Throughput	CB Throughput
4	6,510ms	541ms	44.4 tok/s	37.7 tok/s
8	12,837ms	247ms	44.1 tok/s	49.2 tok/s

At concurrency 8, continuous batching delivers a 52x improvement in TTFT — from 12.8 seconds to 247ms. Sequential throughput is flat because it's bottlenecked by single-request speed. CB throughput scales by saturating GPU parallelism across concurrent sequences.

Configuration Examples

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_type": "lm",
    "continuous_batching": true,
    "cb_max_num_seqs": 256,
    "cb_prefill_batch_size": 16,
    "cb_completion_batch_size": 32
  }'

Or via config.yaml (experimental):

models:
  - model_id: "raptor-batched"
    model_type: "lm"
    model_path: "srswti/bodega-raptor-8b-mxfp4"
    continuous_batching: true
    cb_max_num_seqs: 256
    cb_prefill_batch_size: 16
    cb_completion_batch_size: 32

The Configuration Flags Explained

To tune the batching engine, you have 5 main levers:

1. `--cb-max-num-seqs` (Default: 256)

What it is: The absolute maximum number of sequences (requests) the engine is allowed to hold in its scheduler at one time. How to tune:

If this is too low, requests will be rejected under heavy load.
If it's too high, you might run out of KV-cache memory, causing MLX to swap to disk (very slow).
Set this based on your available RAM. 256 is safe for M1/M2/M3 Max chips (64GB) with 8B models.

2. `--cb-completion-batch-size` (Default: 32)

What it is: The maximum number of sequences that can be actively generating tokens in the GPU simultaneously. How to tune:

Above ~32 concurrent generations, you start hitting computation limits on Apple Silicon GPUs, and individual Time-To-First-Token (TTFT) or Time-Per-Output-Token (TPOT) will rise.
32 means MLX will multiply the weights against a matrix of size 32 on every generation step.

3. `--cb-prefill-batch-size` (Default: 8)

What it is: When a burst of 50 new requests arrives, how many of them do we inject into the active batch on the very next step? How to tune:

Prefilling (processing the initial prompt) is computationally heavy. If you try to prefill 50 prompts at once, the GPU hangs for several seconds. If there are other requests currently generating tokens, those users will experience a massive stutter.
By capping this at 8, we ensure that new requests are digested in small bites. The active generation stream might pause for 100ms instead of 3000ms.

4. `--cb-chunked-prefill-tokens` (Default: 2048)

What it is: What if a single user submits a massive 16,000-token prompt? That alone will block the GPU. Chunked prefill solves this by splitting that 16K prompt into 2048-token chunks. How to tune:

During step 1, it processes chunk 1 (0-2048) alongside the active token generations.
Step 2: chunk 2 (2048-4096) + active generations.
This entirely eliminates the "long prompt stutter" problem for concurrent users. Set to 0 to disable.

5. `--cb-enable-prefix-cache` (Default: True)

What it is: Automatic prompt caching. If User A asks a question about a 10,000 token document, the engine calculates the KV-cache and stores it in memory blocks. If User B asks a different question about the exact same document, the engine recognizes the shared prefix and instantly reuses the 10,000 token cache, dropping TTFT from seconds to milliseconds. How to tune: Leave it on. It uses block-aware memory management to automatically evict the oldest prefixes when you hit MLX memory pressure.

So here are the Tuning Parameters

Parameter	Recommended	What It Controls
`cb_max_num_seqs`	256	Total scheduler capacity — active + waiting sequences combined. Lower this to 64 on 16GB Macs with large models to prevent KV-cache overflow and disk swapping.
`cb_completion_batch_size`	32	Max concurrent token generations per GPU step. The primary throughput lever. Above ~32 on small models, Apple Silicon hits compute saturation and per-token speed degrades. For 21B+ models, cap at 16.
`cb_prefill_batch_size`	8–16	How many new prompt-ingestion requests are allowed to enter the active batch per step. This is your TTFT fairness lever. Higher values process bursts faster but can cause brief generation stutter for active streams during the prefill phase.
`cb_chunked_prefill_tokens`	2048	Splits very long prompts into chunks ingested across multiple steps. Prevents a single massive-context request from freezing generation for everyone else.
`cb_enable_prefix_cache`	true	Block-aware KV-cache. Recognizes shared prefixes across requests (identical system prompts, shared documents) and reuses computed KV blocks, eliminating re-ingestion entirely.

Benchmark Results

blackbird-she-doesnt-refuse-21b (Hybrid SWA/Global attention)

Concurrency	Wall Time	Throughput	Mean TTFT	P95 TTFT
4	22.22s	37.7 tok/s	541ms	1463ms
8	13.96s	49.2 tok/s	247ms	372ms
16	16.71s	63.7 tok/s	1444ms	2880ms

Peak: 1.69x throughput gain. Gains plateau after concurrency 8 — this model is memory-bandwidth-bound at 21B. TTFT climbs at concurrency 16 as the prefill queue builds up.

deepseek-raptor-32b-4bit

Concurrency	Wall Time	Throughput	Mean TTFT	P95 TTFT
1	188.25s	8.8 tok/s	461ms	1629ms
4	161.37s	9.3 tok/s	773ms	1356ms
8	145.56s	10.4 tok/s	9,802ms	36,049ms
16	162.22s	9.8 tok/s	39,025ms	93,014ms

Peak: 1.18x gain, marginal. At 32B, this model is heavily compute-bound. Adding batch concurrency provides minimal throughput benefit while TTFT explodes. Recommended concurrency: 1–4.

bodega-raptor-0.9b

Concurrency	Wall Time	Throughput	Mean TTFT	P95 TTFT
4	46.26s	57.3 tok/s	234ms	976ms
8	44.41s	58.5 tok/s	241ms	318ms
16	32.30s	80.7 tok/s	440ms	567ms
32	20.57s	127.5 tok/s	923ms	1455ms

Peak: 2.23x throughput gain at concurrency 32. Strong scaling characteristic of sub-1B models where the GPU is purely bandwidth-bound.

mlx-community/Qwen3.5-2B-6bit

Concurrency	Wall Time	Throughput	Mean TTFT	P95 TTFT
4	12.96s	313.3 tok/s	204ms	1239ms
8	10.13s	384.2 tok/s	126ms	255ms
16	9.22s	447.2 tok/s	290ms	458ms
32	6.87s	619.8 tok/s	596ms	617ms

Peak: 1.98x gain at ~620 tok/s measured over HTTP. The ~900 tok/s figure was measured by calling the batching engine directly in-process — no HTTP server, no SSE serialization, no network stack. That number represents the raw inference ceiling on m4 Max. The ~280 tok/s gap you see in the HTTP benchmark is entirely the server layer, not the inference engine. See The HTTP Bottleneck below.

Detailed sweep — mlx-community/Qwen3.5-2B-6bit (mixed and same-query)

Scenario	Concurrency	Prefill Batch	Mean TTFT	P95 TTFT	Per-Req TPS	System Throughput
Mixed	8	2	196ms	316ms	58.7	450.5 tok/s
Mixed	8	4	206ms	289ms	59.9	451.6 tok/s
Mixed	8	8	258ms	259ms	62.4	462.6 tok/s
Mixed	16	4	344ms	557ms	36.0	534.0 tok/s
Mixed	16	8	321ms	425ms	36.6	536.0 tok/s
Mixed	16	16	331ms	332ms	37.3	547.4 tok/s
Mixed	32	8	384ms	676ms	31.2	889.5 tok/s
Mixed	32	16	424ms	616ms	31.4	901.3 tok/s
Same query	8	2	165ms	278ms	60.7	439.7 tok/s
Same query	8	4	144ms	198ms	64.0	475.8 tok/s
Same query	8	8	162ms	162ms	64.5	480.1 tok/s
Same query	16	4	316ms	543ms	35.8	517.9 tok/s
Same query	16	8	269ms	365ms	36.8	543.8 tok/s
Same query	16	16	301ms	302ms	37.6	556.2 tok/s
Same query	32	8	469ms	773ms	31.0	870.3 tok/s
Same query	32	16	467ms	638ms	31.8	902.2 tok/s

Key Takeaways

1. Throughput scales near-linearly with concurrency for small models. Without CB, system throughput equals per-request TPS (~60 tok/s). With CB at concurrency 32, you reach ~900 tok/s system throughput — a 15x total throughput gain on the same hardware.

2. Prefill batch size is a TTFT fairness lever, not a throughput lever. Notice P95 TTFT at concurrency 16: with prefill batch 4, P95 is 557ms — some users are waiting because they're stuck behind multiple prefill rounds. With prefill batch 16, P95 drops to 332ms and mean is 331ms — everyone in the burst gets their first token at nearly the same time. The rule: if you expect burst traffic (many requests arriving simultaneously), set a higher prefill batch. If requests arrive organically over time, a lower prefill batch keeps active generation streams smoother.

3. Prefix caching is a meaningful TTFT accelerator. At concurrency 8, mixed queries average 196–258ms TTFT. The same query (shared prefix, cache hit for all subsequent requests) drops to 144ms mean TTFT with a P95 of 198ms. The engine computed the prompt KV-cache once and reused it across all 8 requests. Per-request TPS also climbs from ~59 to ~64.5 because subsequent requests skip prompt ingestion entirely.

4. Large models (21B+) have a concurrency sweet spot. For the 32B model, optimal concurrency is 1–4. Pushing to 8+ concurrent requests causes TTFT to spike into the tens of seconds — the GPU is compute-saturated and the KV-cache grows large enough to risk swapping. For 21B models, concurrency 8 is the practical ceiling before TTFT becomes unacceptable for real-time users.

The HTTP Bottleneck

For small and mid-size models, the batching engine is fast enough that the HTTP server itself becomes the bottleneck — a situation that is uncommon in most inference systems and speaks to how aggressively the Bodega inference engine saturates Apple Silicon's memory bandwidth.

What's happening: When the batching engine generates tokens, it produces them in steps. Each step generates one token per active sequence simultaneously, then the output needs to be: serialized to JSON, wrapped in a Server-Sent Events data: frame, written to each open HTTP response stream, and flushed through the OS network stack. For large models generating at 8–30 tok/s, this overhead is negligible. For a model running at 900 tok/s in-process across 32 concurrent streams, each engine step completes in milliseconds — and the HTTP layer starts struggling to keep up with the token emission rate.

The measured gap on M4 Max with Qwen3.5-2B-6bit:

Mode	Throughput
In-process (direct engine call, no HTTP)	~900 tok/s
HTTP with streaming (`text/event-stream`)	~600 tok/s
Gap	~300 tok/s (~33% overhead)

A note on the measured 600 tok/s figure: This was recorded on a live macOS system, not an isolated benchmark environment. Apple Silicon's unified memory architecture makes this more significant than it would be on a discrete GPU system. On a dedicated GPU, inference has its own VRAM and the CPU/system RAM is separate. On Apple Silicon, everything — the inference engine, WindowServer, your browser's GPU process, Electron renderers — shares the same memory bus and the same Metal command queue. So a busy Electron app isn't just using CPU, it's genuinely competing for the same memory bandwidth that the inference engine depends on. The true HTTP ceiling on a fully idle machine may be measurably higher than 600 tok/s. The in-process ~900 tok/s figure is a tighter measurement by comparison since it bypasses the HTTP layer entirely, but both numbers should be treated as real-world approximations rather than hardware ceilings.

The ~300 tok/s gap is not lost inference work — the GPU is generating tokens at the same rate regardless. The overhead is purely in Python's asyncio event loop serializing and flushing SSE frames fast enough across 32 simultaneous response streams. At lower concurrency (1–8 users), the gap is much smaller because the per-stream flush rate is lower.

Refined analysis

The core claim — that Python asyncio becomes the bottleneck before the inference engine does at high concurrency — is technically sound. SSE per-token flushing is genuinely expensive, and vLLM, llama.cpp server, and TGI have all documented this. The phenomenon is real.

What needs to be corrected or made more precise: the original says the ~300 tok/s gap is "purely in Python's asyncio event loop serializing and flushing SSE frames." That's incomplete. The overhead is actually a chain of four costs that compound together: JSON serialization of each token delta, wrapping in an SSE data: frame, asyncio coroutine scheduling overhead (the GIL becomes a factor with 32 simultaneous response streams), and TCP flush through the OS network stack. Attributing it all to "asyncio event loop" understates the full picture.

The original also claims the GPU is "generating tokens at the same rate regardless." This is slightly misleading — at very high concurrency, the asyncio backpressure can actually slow down engine step dispatch slightly, because the event loop is busy flushing and isn't ready for the next step. The GPU isn't entirely independent.

Theoretical optimised ceiling

If we implement batched token emission (buffering 5–10ms of tokens before flushing rather than one flush per engine step), the estimated recovery is roughly 200–250 tok/s, bringing HTTP throughput to around ~820 tok/s at concurrency 32. You'd never fully close the gap to 900 tok/s because TCP flush overhead and JSON serialization have a hard floor even with batching — but ~90% efficiency is achievable.

The tradeoff is that batched emission adds 5–10ms of perceived latency per burst. At high concurrency that's completely invisible. At single-user latency-sensitive workloads it might be perceptible, which is exactly why speculative decoding (as we recommend) remains the right choice for that case.

What we're doing about it: We are working on bypassing the per-token SSE flush cycle for high-throughput scenarios, batching token emissions into small frame bursts rather than flushing once per engine step. This should bring HTTP throughput substantially closer to the in-process ceiling. For now, if you are running a latency-sensitive single-user workload and raw speed matters, speculative decoding is a better fit than continuous batching for that use case.

Hardware Guidelines

Small models (90M–8B) on any Mac with 16GB+ RAM:

cb_max_num_seqs: 256
cb_completion_batch_size: 32
cb_prefill_batch_size: 16

Large models (14B–32B) on 32GB+ RAM:

cb_max_num_seqs: 64
cb_completion_batch_size: 16
cb_prefill_batch_size: 4–8

Custom Chat Templates

Override a model's default chat template:

curl -X POST http://localhost:44468/v1/admin/load-model \
  -H "Content-Type: application/json" \
  -d '{
    "model_path": "SRSWTI/bodega-raptor-8b-mxfp4",
    "model_type": "lm",
    "chat_template_file": "/path/to/custom_template.jinja"
  }'

Monitoring and Health

Health Check

Endpoint: GET /health

curl http://localhost:44468/health

Healthy (multi-model):

{
  "status": "ok",
  "model_id": "bodega-solomon-9b, bodega-raptor-8b",
  "model_status": "initialized (2 model(s))",
  "models_detail": [
    {"id": "bodega-solomon-9b", "type": "multimodal", "status": "running", "ram_usage_mb": 11645.8},
    {"id": "bodega-raptor-8b", "type": "lm", "status": "running", "ram_usage_mb": 4558.4}
  ]
}

No models loaded:

{
  "status": "unhealthy",
  "model_id": null,
  "model_status": "no_models"
}

Queue Statistics

Endpoint: GET /v1/queue/stats

curl http://localhost:44468/v1/queue/stats

response = requests.get("http://localhost:44468/v1/queue/stats")
stats = response.json()["queue_stats"]
print(f"Queue size: {stats.get('queue_size', 0)}")
print(f"Active requests: {stats.get('active_requests', 0)}")

Best Practices

Model Selection

Our Open source Work

Explore our Models: Hugging Face
Coding CLI: axe on GitHub

Fastest (edge/laptop):

srswti/bodega-orion-0.6b — Sub-100M params, exceptional tool calling and reasoning at the edgehttps://huggingface.co/srswti
SRSWTI/bodega-raptor-0.9b — 400+ tok/s, ideal for classification and query reformulation
SRSWTI/axe-turbo-1b — Sub-50ms first token, edge-first agentic coding

Balanced performance:

SRSWTI/bodega-raptor-1b-reasoning-opus4.5-distill — Distilled from Claude Opus 4.5 reasoning patterns
SRSWTI/bodega-vertex-4b — Optimized for structured data processing
SRSWTI/bodega-raptor-8b-mxfp4 — Best general-purpose choice for laptops

Multimodal and agentic:

SRSWTI/bodega-solomon-9b — Vision + best-in-class agentic coding workflows

High capacity:

SRSWTI/bodega-raptor-15b-6bit — Enhanced Raptor variant
SRSWTI/bodega-centenario-21b-mxfp4 — Production workhorse, 21B params optimized for sustained workloads
SRSWTI/blackbird-she-doesnt-refuse-21b — Uncensored 21B for unrestricted generation
SRSWTI/axe-turbo-31b — High-capacity desktop/server variant with agentic coding focus

Flagship intelligence:

SRSWTI/deepseek-v3.2-speciale-distilled-raptor-32b-4bit — DeepSeek V3.2 distilled to 32B with Raptor reasoning. Exceptional math and code generation in a 5–7GB footprint. 120 tok/s on m4 Max.

Memory Management

Use the smallest context length that fits your use case
Unload models you're not actively using to free unified memory
Monitor queue stats to avoid overloading the scheduler
Prefer quantized (4-bit or 8-bit) models for better memory efficiency

Performance Optimization

Set max_concurrency: 1 for single-user scenarios
Use streaming for long responses to improve perceived latency
Enable prompt_cache_size for workloads with recurring prefixes
Use speculative decoding for single-user, latency-sensitive workloads
Use continuous batching for multi-user, throughput-sensitive workloads

Error Handling

response = requests.post("http://localhost:44468/v1/chat/completions", json={...})

if response.status_code == 503:
    print("No model loaded. Load a model first.")
elif response.status_code == 400:
    print("Invalid request parameters.")
elif response.status_code == 200:
    result = response.json()
else:
    print(f"Error: {response.status_code}")

Document Indexing (RAG)

Bodega includes a fully self-contained RAG pipeline for PDF documents

Upload & Index a PDF

Endpoint: POST /v1/rag/upload

curl -X POST http://localhost:44468/v1/rag/upload \
  -F "file=@/path/to/your/document.pdf"

Response:

{
  "file_id": "rag-c6cd8f10",
  "filename": "document.pdf",
  "num_chunks": 71,
  "status": "indexed"
}

Query an Indexed PDF

The engine embeds your question, retrieves the most relevant chunks via FAISS cosine-similarity, and passes the context alongside your query to the active chat model.

Endpoint: POST /v1/rag/query

curl -X POST http://localhost:44468/v1/rag/query \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "rag-c6cd8f10",
    "query": "What is the main conclusion of this document?",
    "model": "bodega-raptor-8b",
    "top_k": 5
  }'

Add "stream": true to receive the answer as a Server-Sent Events stream, identical to the standard /v1/chat/completions endpoint.

List Indexed Documents

Endpoint: GET /v1/rag/documents

curl http://localhost:44468/v1/rag/documents

Delete an Indexed Document

Endpoint: DELETE /v1/rag/documents/{file_id}

curl -X DELETE http://localhost:44468/v1/rag/documents/rag-c6cd8f10

Security

The server runs on localhost:44468 only and is not accessible from external networks
No authentication is required for local access
Do not expose this port to the internet without adding proper security measures
Only set trust_remote_code: true for models from verified sources

Documentation last updated: March 2026

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
results		results
.gitignore		.gitignore
README.md		README.md
benchmark_continuous_batching.py		benchmark_continuous_batching.py
benchmark_http_concurrency.py		benchmark_http_concurrency.py
benchmark_llm.py		benchmark_llm.py
benchmark_streaming.py		benchmark_streaming.py
compare_engines.py		compare_engines.py
detect_model_type.py		detect_model_type.py
hardware_info.py		hardware_info.py
install.sh		install.sh
install_sensors.sh		install_sensors.sh
interactive_shell.py		interactive_shell.py
raw_test.py		raw_test.py
requirements.txt		requirements.txt
setup.sh		setup.sh
show_results.py		show_results.py
sweep_cb_configs.py		sweep_cb_configs.py

Folders and files

Latest commit

History

Repository files navigation

What we recommend

Benchmarks & Leaderboard

Bodega Inference Engine

Table of Contents

Getting Started

Quick Implement

Quick Start (Multi-Model Registry)

Python Quick Start

Core Endpoints

Chat Completions

Basic Request

Streaming Response

Python Streaming Example

Request Parameters

Response Format

Structured Outputs (JSON Schema)

Multimodal Completions (Vision)

URL Image

Local Base64 Image

Image Generation

Image Editing

Model Management

Load Model

Load a Language Model

Load an Image Generation Model

Python Example

Mapping HuggingFace Model Types to model_type

Available Parsers

Unload Model

Delete Model

List Loaded Models & Memory Usage

Model Discovery

List Available Models

Download Model

Download Model with Progress

Advanced Features

Reasoning Models

JSON Mode

Prompt Caching

Speculative Decoding

Why generation is slow on large models

What speculative decoding does

Requirements

Configuration

Response

Continuous Batching (High Throughput)

How It Works

Sequential vs. Continuous Batching

Configuration Examples

The Configuration Flags Explained

1. --cb-max-num-seqs (Default: 256)

2. --cb-completion-batch-size (Default: 32)

3. --cb-prefill-batch-size (Default: 8)

4. --cb-chunked-prefill-tokens (Default: 2048)

5. --cb-enable-prefix-cache (Default: True)

Benchmark Results

Key Takeaways

The HTTP Bottleneck

Hardware Guidelines

Custom Chat Templates

Monitoring and Health

Health Check

Queue Statistics

Best Practices

Model Selection

Memory Management

Performance Optimization

Error Handling

Document Indexing (RAG)

Upload & Index a PDF

Query an Indexed PDF

List Indexed Documents

Delete an Indexed Document

Security

About

Resources

Mapping HuggingFace Model Types to `model_type`

1. `--cb-max-num-seqs` (Default: 256)

2. `--cb-completion-batch-size` (Default: 32)

3. `--cb-prefill-batch-size` (Default: 8)

4. `--cb-chunked-prefill-tokens` (Default: 2048)

5. `--cb-enable-prefix-cache` (Default: True)

Packages