Skip to content

[Feature]: Support for Triton attention backend for inference #1544

@stepfunction83

Description

@stepfunction83

🚀 The feature, motivation and pitch

Currently, PagedAttention only supports specific head_size values. This prevents models like Magistral 2509 (with a head_size of 160) from running. vLLM resolves this by using Triton as the inference backend instead of PagedAttention in these situations.

I recommend providing Triton as an alternative in situations where PagedAttention is not suitable for running a model.

Alternatives

Don't support a range of models with head_size values unsupported by PagedAttention.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions