Skip to content

ag027592/jhcodec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official Implementation of JHCodec

GitHub Repo stars HuggingFace Checkpoints GitHub.io Audio Samples

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

JHCodec is a pure Transformer decoder based neural audio codec with residual vector quantization. It shows state-of-the-art performance with minimal latency.

Overview

This repository contains the implementation for training and inference neural audio codecs with end-to-end training capabilities. The codec supports:

  • Multiple RVQ architectures (DAC, MIMI)
  • Supports end-to-end training leveraging (distilled) w2v-bert-2.0 semantic features
  • SSRR and non-SSRR variants

TODO

  • Revise Readme
  • Upload checkpoint
  • Upload to HuggingFace
  • Upload to PyPI (probably after the review)
  • Make non-anonymous (after the review)

Installation

pip install -e .

Requirements

  • Python >= 3.10 (required for using the X | None union type syntax in type hints; see PEP 604), or manually remove this syntax if using an older Python version
  • PyTorch/TorchAudio with CUDA support: tested with torch==2.6.0+cu124 and torch==2.9.1+cu128
  • omegaconf==2.3.0: for configuration management
  • Flash-Attention: For fast train/inference. We tested with flash-attn==2.7.4.post1 and flash-attn==2.8.3.
  • HF transformers: Required only for running baselines and w2v-bert2.0. JHCodec inference has no dependency on it.

We have provided a Shell Script to help set up the environment. PLEASE DO NOT RUN It Directly. INSTEAD, REVIEW THE SCRIPT AND MODIFY IT AS NEEDED FOR YOUR SYSTEM.

OUR MODEL REQUIRES ONLY THE MINIMUM DEPENDENCIES LISTED ABOVE.

Note: Running on CPU currently leads to degraded reconstruction quality. If you discover the cause or a solution, please open an issue.

For training

To install both required libraries, run:

pip install omegaconf==2.3.0
pip install alias-free-torch==0.0.6 phaseaug

Flash-Attention should be installed carefully. Please read the official README.

Official Checkpoints

Model Description Link
JHCodec Streaming RVQ Codec, JHCodec-M (1M) jhcodec/jhcodec
SW2V (60k) Streaming Speech Representation Extractor jhcodec/sw2v_60k
SW2V (120k) Streaming Speech Representation Extractor, more robust to noise jhcodec/sw2v_120k

Project Structure

codec_paper/
├── jhcodec/                      # Main package
│   ├── model/                    # Model implementations
│   │   ├── codec.py              # Main codec models (JHCodec, JHCodecMimi)
│   │   ├── sw2v.py               # streaming wav2vec encoder
│   │   ├── discriminator.py      # Discriminator for adversarial training
│   │   └── vq.py                 # Vector quantization modules
│   ├── kernel/                   # Triton custom kernels 
│   │   ├── rotary_kernel.py      # Rotary positional embedding's kernel, adopted from FlashAttn
|   |   └── vq_kernel.py          # Vector quantization kernel
│   ├── loss/                     # Custom loss functions
│   │   └── multiscalemelspec.py  # Implements MultiScaleMelSpectrogramLoss used for perceptual audio training
│   ├── train_codec_e2e_w2v.py    # End-to-end training script
│   ├── decode_eval.py            # Decoding and evaluation script
│   └── dataloader.py             # Data loading utilities
├── config/                       # Configuration files
│   ├── config_dac_norecon.json   # DAC without reconstruction
│   ├── config_dac_recon.json     # DAC with reconstruction
│   ├── config_mimi_norecon.json  # MIMI without reconstruction
│   └── config_mimi_recon.json    # MIMI with reconstruction
└── setup.py

Training

Training Command:

Data Preparation Using the main of jhcodec/dataloader.py

The main block of dataloader.py demonstrates how to construct and inspect an AudioDataset:

from jhcodec.dataloader import AudioDataset, collate_fn
from torch.utils.data import DataLoader

dataset = AudioDataset(
    audio_dir='./data',                  # Path to your data
    sample_rate=16000,
    segment_duration=10.24,
    training=True,
    init_dataset=False,                  # Use True to scan files initially (slow), or False to load from cache
    cache_dir='cache_dir/dataloader/v9', # location of the cache
    use_mel=False,                       # Set True to return also Mel features
)

Notes:

  • Initial dataset caching may take a while; once done, restart with init_dataset=False for faster loading.
  • Requires all dependencies (see top part of jhcodec/dataloader.py).
  • You can add a custom dataset by modifying the dictionary at the top of dataloader.py.

For DAC with reconstruction:

python jhcodec/train_codec_e2e_w2v.py \
    --experiment_name paper/dac_recon \
    --config config/config_dac_recon.json \
    --resume # if resume

For MIMI with reconstruction:

python jhcodec/train_codec_e2e_w2v.py \
    --experiment_name paper/mimi_recon \
    --config config/config_mimi_recon.json \
    --resume

Available Configurations:

  • config_dac_norecon.json - DAC without reconstruction
  • config_dac_recon.json - DAC with reconstruction
  • config_mimi_norecon.json - MIMI without reconstruction
  • config_mimi_recon.json - MIMI with reconstruction (main)

Training Parameters

Key training parameters (configurable in JSON config files):

  • learning_rate: 1e-4
  • batch_size: 42
  • num_epochs: 100
  • warmup_steps: 1000
  • discriminator_start_steps: 10000
  • Loss weights for reconstruction, VQ, commit, feature matching, and adversarial losses

Decoding

Single Files

python jhcodec/inference.py \
    --config config/config_mimi_recon.json \
    --checkpoint jhcodec_mimi_1000000.pt \
    --input_file /path/to/input.wav \
    --output_file /path/to/output.wav \
    --num_codebooks 8 \
    --device 'cuda'

Multiple Files

python jhcodec/decode_eval.py \
    --config config/config_dac_norecon.json \
    --checkpoint /path/to/checkpoint_300000.pt \
    --name jhcodec_dac_norecon \
    --glob_pattern "/path/to/audio/*.wav" \
    --out_dir "out_dir" \
    --hierarchy 4

Arguments:

  • --config: Path to configuration file
  • --checkpoint: Path to model checkpoint
  • --name: Model name for output directory
  • --glob_pattern: Glob pattern for input audio files
  • --hierarchy: Depth of quantization hierarchy (default: 4)
  • --out_dir: Output directory

Supported Datasets

The decoding script supports various audio datasets:

  • LibriSpeech: /data/LibriSpeech/test-other/*/*/*.flac
  • TITW: /data/titw/titw_hard/test/*.wav
  • MLS: /data/MLS/mls_*/test/audio/*/*/*.flac

Configuration

Configuration files are JSON-based and include:

  • Model Architecture: Encoder/decoder layers, attention heads, embedding dimensions
  • Vector Quantization: Codebook size, number of codebooks, embedding dimensions
  • Training: Learning rate, batch size, loss weights, discriminator settings
  • Data: Sample rate, segment duration, data directories
  • Logging: Checkpoint intervals, tensorboard settings

Example configuration structure:

{
    "model": {
        "encoder": {...},
        "decoder": {...},
        "rvq": {
            "type": "dac",
            "num_codebooks": 8,
            "codebook_size": 1024
        }
    },
    "training": {...},
    "loss": {...},
    "data": {...}
}

Main Contact

Anonymous. Contact: jhcodec843@gmail.com Submitted to Interspeech 2026

References

License

MIT License

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.8%
  • Shell 0.2%