Official Implementation of JHCodec

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

JHCodec is a pure Transformer decoder based neural audio codec with residual vector quantization. It shows state-of-the-art performance with minimal latency.

Overview

This repository contains the implementation for training and inference neural audio codecs with end-to-end training capabilities. The codec supports:

Multiple RVQ architectures (DAC, MIMI)
Supports end-to-end training leveraging (distilled) w2v-bert-2.0 semantic features
SSRR and non-SSRR variants

TODO

Installation

pip install -e .

Requirements

Python >= 3.10 (required for using the X | None union type syntax in type hints; see PEP 604), or manually remove this syntax if using an older Python version
PyTorch/TorchAudio with CUDA support: tested with torch==2.6.0+cu124 and torch==2.9.1+cu128
omegaconf==2.3.0: for configuration management
Flash-Attention: For fast train/inference. We tested with flash-attn==2.7.4.post1 and flash-attn==2.8.3.
HF transformers: Required only for running baselines and w2v-bert2.0. JHCodec inference has no dependency on it.

We have provided a Shell Script to help set up the environment. PLEASE DO NOT RUN It Directly. INSTEAD, REVIEW THE SCRIPT AND MODIFY IT AS NEEDED FOR YOUR SYSTEM.

OUR MODEL REQUIRES ONLY THE MINIMUM DEPENDENCIES LISTED ABOVE.

Note: Running on CPU currently leads to degraded reconstruction quality. If you discover the cause or a solution, please open an issue.

For training

To install both required libraries, run:

pip install omegaconf==2.3.0
pip install alias-free-torch==0.0.6 phaseaug

Flash-Attention should be installed carefully. Please read the official README.

Official Checkpoints

Model	Description	Link
JHCodec	Streaming RVQ Codec, JHCodec-M (1M)	jhcodec/jhcodec
SW2V (60k)	Streaming Speech Representation Extractor	jhcodec/sw2v_60k
SW2V (120k)	Streaming Speech Representation Extractor, more robust to noise	jhcodec/sw2v_120k

Project Structure

codec_paper/
├── jhcodec/                      # Main package
│   ├── model/                    # Model implementations
│   │   ├── codec.py              # Main codec models (JHCodec, JHCodecMimi)
│   │   ├── sw2v.py               # streaming wav2vec encoder
│   │   ├── discriminator.py      # Discriminator for adversarial training
│   │   └── vq.py                 # Vector quantization modules
│   ├── kernel/                   # Triton custom kernels 
│   │   ├── rotary_kernel.py      # Rotary positional embedding's kernel, adopted from FlashAttn
|   |   └── vq_kernel.py          # Vector quantization kernel
│   ├── loss/                     # Custom loss functions
│   │   └── multiscalemelspec.py  # Implements MultiScaleMelSpectrogramLoss used for perceptual audio training
│   ├── train_codec_e2e_w2v.py    # End-to-end training script
│   ├── decode_eval.py            # Decoding and evaluation script
│   └── dataloader.py             # Data loading utilities
├── config/                       # Configuration files
│   ├── config_dac_norecon.json   # DAC without reconstruction
│   ├── config_dac_recon.json     # DAC with reconstruction
│   ├── config_mimi_norecon.json  # MIMI without reconstruction
│   └── config_mimi_recon.json    # MIMI with reconstruction
└── setup.py

Training

Training Command:

Data Preparation Using the main of jhcodec/dataloader.py

The main block of dataloader.py demonstrates how to construct and inspect an AudioDataset:

from jhcodec.dataloader import AudioDataset, collate_fn
from torch.utils.data import DataLoader

dataset = AudioDataset(
    audio_dir='./data',                  # Path to your data
    sample_rate=16000,
    segment_duration=10.24,
    training=True,
    init_dataset=False,                  # Use True to scan files initially (slow), or False to load from cache
    cache_dir='cache_dir/dataloader/v9', # location of the cache
    use_mel=False,                       # Set True to return also Mel features
)

Notes:

Initial dataset caching may take a while; once done, restart with init_dataset=False for faster loading.
Requires all dependencies (see top part of jhcodec/dataloader.py).
You can add a custom dataset by modifying the dictionary at the top of dataloader.py.

For DAC with reconstruction:

python jhcodec/train_codec_e2e_w2v.py \
    --experiment_name paper/dac_recon \
    --config config/config_dac_recon.json \
    --resume # if resume

For MIMI with reconstruction:

python jhcodec/train_codec_e2e_w2v.py \
    --experiment_name paper/mimi_recon \
    --config config/config_mimi_recon.json \
    --resume

Available Configurations:

config_dac_norecon.json - DAC without reconstruction
config_dac_recon.json - DAC with reconstruction
config_mimi_norecon.json - MIMI without reconstruction
config_mimi_recon.json - MIMI with reconstruction (main)

Training Parameters

Key training parameters (configurable in JSON config files):

learning_rate: 1e-4
batch_size: 42
num_epochs: 100
warmup_steps: 1000
discriminator_start_steps: 10000
Loss weights for reconstruction, VQ, commit, feature matching, and adversarial losses

Decoding

Single Files

python jhcodec/inference.py \
    --config config/config_mimi_recon.json \
    --checkpoint jhcodec_mimi_1000000.pt \
    --input_file /path/to/input.wav \
    --output_file /path/to/output.wav \
    --num_codebooks 8 \
    --device 'cuda'

Multiple Files

python jhcodec/decode_eval.py \
    --config config/config_dac_norecon.json \
    --checkpoint /path/to/checkpoint_300000.pt \
    --name jhcodec_dac_norecon \
    --glob_pattern "/path/to/audio/*.wav" \
    --out_dir "out_dir" \
    --hierarchy 4

Arguments:

--config: Path to configuration file
--checkpoint: Path to model checkpoint
--name: Model name for output directory
--glob_pattern: Glob pattern for input audio files
--hierarchy: Depth of quantization hierarchy (default: 4)
--out_dir: Output directory

Supported Datasets

The decoding script supports various audio datasets:

LibriSpeech: /data/LibriSpeech/test-other/*/*/*.flac
TITW: /data/titw/titw_hard/test/*.wav
MLS: /data/MLS/mls_*/test/audio/*/*/*.flac

Configuration

Configuration files are JSON-based and include:

Model Architecture: Encoder/decoder layers, attention heads, embedding dimensions
Vector Quantization: Codebook size, number of codebooks, embedding dimensions
Training: Learning rate, batch size, loss weights, discriminator settings
Data: Sample rate, segment duration, data directories
Logging: Checkpoint intervals, tensorboard settings

Example configuration structure:

{
    "model": {
        "encoder": {...},
        "decoder": {...},
        "rvq": {
            "type": "dac",
            "num_codebooks": 8,
            "codebook_size": 1024
        }
    },
    "training": {...},
    "loss": {...},
    "data": {...}
}

Main Contact

Anonymous. Contact: jhcodec843@gmail.com Submitted to Interspeech 2026

References

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset		asset
config		config
jhcodec		jhcodec
results		results
LICENSE		LICENSE
README.md		README.md
installcu128.sh		installcu128.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Official Implementation of JHCodec

Overview

TODO

Installation

Requirements

For training

Official Checkpoints

Project Structure

Training

For DAC with reconstruction:

For MIMI with reconstruction:

Training Parameters

Decoding

Supported Datasets

Configuration

Main Contact

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Official Implementation of JHCodec

Overview

TODO

Installation

Requirements

For training

Official Checkpoints

Project Structure

Training

For DAC with reconstruction:

For MIMI with reconstruction:

Training Parameters

Decoding

Supported Datasets

Configuration

Main Contact

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages