Skip to content

aospan/perception_models

 
 

Repository files navigation

Perception Models: Powerful Models for Image and Video Perception

Meta AI Research, FAIR

Perception Models is home to the state-of-the-art for image and video perception: Perception Encoder (PE) for encoding and Perception Language Model (PLM) for decoding. We designed Perception Models as a user-friendly repository to support the training, inference, and evaluation of these two models, with an emphasis on making the code modular and easy to expand and experiment with.

  • [Apr-18-25]: Perception Language Model (PLM) and PLM-VideoBench are added to lmms-eval. This makes it easy to reproduce PLM results and allows you to evaluate on the PLM-VideoBench. [lmms-eval] 🔥🔥
  • [Apr-17-25]: Perception Encoder (PE) and Perception Language Model (PLM) are released. [Blog] 🔥🔥

Perception Encoder (PE)

Hugging Face Collection Paper Paper Model License

Perception Encoder (PE) is an extremely powerful and versatile family of vision encoders for both images and video: PE core can outperform SigLIP2 on Image CLIP and InternVideo2 on Video CLIP; PE lang can be used to outperform QwenVL2.5 and InternVL3 on vision language modeling; and PE spatial can outperform DINOv2 on dense prediction tasks. And all of this follows the same, easily scalable contrastive pretraining.

See apps/pe/README.md for more information and how to get started using them!

Perception Language Model (PLM)

Hugging Face Collection Paper Paper Model License

PerceptionLM (PLM) is a family of open and fully reproducible models to facilitate research in vision-language modeling (VLM). In conjunction with PE, it is powerful enough to compete with the latest state-of-the-art VLMs such as InternVL3 and QwenVL2.5, while using fully open data. We also release the largest spatiotemporally annotated video dense captioning and fine-grained human activity recognition datasets to ever exist.

See apps/plm/README.md for details and how to get started!

Installation 🔧

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

This will install an editable version of repo, allowing you to make changes to the code without needing to reinstall the package every time.

🙏 Acknowledgement

We are thankful to Meta Lingua for releasing their code as open-source contributions. The code structure and code implementation of the LLM is directly forked from Meta Lingua. We are also thankful to Open_CLIP for open-source contributions in CLIP training, and CLIP_benchmark for CLIP model evaluation.

📜 Citation

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv:2504.13181},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv:2504.13180},
  year={2025}
}

About

State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 65.1%
  • Jupyter Notebook 34.3%
  • Shell 0.6%