Skip to content

berniebear/perception_models_dev

 
 

Repository files navigation

Perception Models: An Easy to Use Repository for Perception Tasks

Meta AI Research, FAIR


Perception Models is a user-friendly repository designed to support the training, inference, and evaluation of Perception Language Model (PLM) and Perception Encoder (PE). It is designed to be modular and easy to expand and experiment with.


  • [Apr-17-25]: Perception Encoder (PE) and Perception Language Model (PLM) are released. [Blog] 🔥🔥

Perception Encoder (PE)

Hugging Face Collection Paper Paper Model License

We release PE - a family of state-of-the-art vision encoders for vision-centric and vision-language tasks. We refer the readers to apps/pe/README.md where we provide details about inference, evaluation and downstream tasks.


Perception Language Model (PLM)

Hugging Face Collection Paper Paper Model License

We release PLM - a family of open and fully reproducible models to facilitate research in vision-language model (VLM). We refer the readers to apps/plm/README.md where we provide details about training, evaluation and inference using PLM.


Installation 🔧

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models

conda create --name perception_models python=3.12
conda activate perception_models

# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124

# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124

pip install -e .

This will install an editable version of repo, allowing you to make changes to the code without needing to reinstall the package every time.


🙏 Acknowledgement

We are thankful to Meta Lingua for releasing their code as open-source contributions. The code structure and code implementation of the LLM is directly forked from Meta Lingua. We are also thankful to Open_CLIP for open-source contributions in CLIP training, and CLIP_benchmark for CLIP model evaluation.

📜 Citation

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv:2504.13181},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv:2504.13180},
  year={2025}
}

About

PE/PLM dev

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 66.5%
  • Jupyter Notebook 32.9%
  • Shell 0.6%