Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

JHCodec is a pure Transformer decoder based neural audio codec with residual vector quantization. It shows state-of-the-art performance with minimal latency.

This repository contains the implementation for training and inference neural audio codecs with end-to-end training capabilities. The codec supports:
- Multiple RVQ architectures (DAC, MIMI)
- Supports end-to-end training leveraging (distilled) w2v-bert-2.0 semantic features
- SSRR and non-SSRR variants
- Revise Readme
- Upload checkpoint
- Upload to HuggingFace
- Upload to PyPI (probably after the review)
- Make non-anonymous (after the review)
pip install -e .- Python >= 3.10 (required for using the
X | Noneunion type syntax in type hints; see PEP 604), or manually remove this syntax if using an older Python version - PyTorch/TorchAudio with CUDA support: tested with
torch==2.6.0+cu124andtorch==2.9.1+cu128 - omegaconf==2.3.0: for configuration management
- Flash-Attention: For fast train/inference. We tested with
flash-attn==2.7.4.post1andflash-attn==2.8.3. - HF transformers: Required only for running baselines and w2v-bert2.0. JHCodec inference has no dependency on it.
We have provided a Shell Script to help set up the environment. PLEASE DO NOT RUN It Directly. INSTEAD, REVIEW THE SCRIPT AND MODIFY IT AS NEEDED FOR YOUR SYSTEM.
OUR MODEL REQUIRES ONLY THE MINIMUM DEPENDENCIES LISTED ABOVE.
Note: Running on CPU currently leads to degraded reconstruction quality. If you discover the cause or a solution, please open an issue.
To install both required libraries, run:
pip install omegaconf==2.3.0
pip install alias-free-torch==0.0.6 phaseaug
Flash-Attention should be installed carefully. Please read the official README.
| Model | Description | Link |
|---|---|---|
| JHCodec | Streaming RVQ Codec, JHCodec-M (1M) | jhcodec/jhcodec |
| SW2V (60k) | Streaming Speech Representation Extractor | jhcodec/sw2v_60k |
| SW2V (120k) | Streaming Speech Representation Extractor, more robust to noise | jhcodec/sw2v_120k |
codec_paper/
├── jhcodec/ # Main package
│ ├── model/ # Model implementations
│ │ ├── codec.py # Main codec models (JHCodec, JHCodecMimi)
│ │ ├── sw2v.py # streaming wav2vec encoder
│ │ ├── discriminator.py # Discriminator for adversarial training
│ │ └── vq.py # Vector quantization modules
│ ├── kernel/ # Triton custom kernels
│ │ ├── rotary_kernel.py # Rotary positional embedding's kernel, adopted from FlashAttn
| | └── vq_kernel.py # Vector quantization kernel
│ ├── loss/ # Custom loss functions
│ │ └── multiscalemelspec.py # Implements MultiScaleMelSpectrogramLoss used for perceptual audio training
│ ├── train_codec_e2e_w2v.py # End-to-end training script
│ ├── decode_eval.py # Decoding and evaluation script
│ └── dataloader.py # Data loading utilities
├── config/ # Configuration files
│ ├── config_dac_norecon.json # DAC without reconstruction
│ ├── config_dac_recon.json # DAC with reconstruction
│ ├── config_mimi_norecon.json # MIMI without reconstruction
│ └── config_mimi_recon.json # MIMI with reconstruction
└── setup.py
Training Command:
Data Preparation Using the main of jhcodec/dataloader.py
The main block of dataloader.py demonstrates how to construct and inspect an AudioDataset:
from jhcodec.dataloader import AudioDataset, collate_fn
from torch.utils.data import DataLoader
dataset = AudioDataset(
audio_dir='./data', # Path to your data
sample_rate=16000,
segment_duration=10.24,
training=True,
init_dataset=False, # Use True to scan files initially (slow), or False to load from cache
cache_dir='cache_dir/dataloader/v9', # location of the cache
use_mel=False, # Set True to return also Mel features
)Notes:
- Initial dataset caching may take a while; once done, restart with
init_dataset=Falsefor faster loading. - Requires all dependencies (see top part of
jhcodec/dataloader.py). - You can add a custom dataset by modifying the dictionary at the top of
dataloader.py.
python jhcodec/train_codec_e2e_w2v.py \
--experiment_name paper/dac_recon \
--config config/config_dac_recon.json \
--resume # if resumepython jhcodec/train_codec_e2e_w2v.py \
--experiment_name paper/mimi_recon \
--config config/config_mimi_recon.json \
--resumeAvailable Configurations:
config_dac_norecon.json- DAC without reconstructionconfig_dac_recon.json- DAC with reconstructionconfig_mimi_norecon.json- MIMI without reconstructionconfig_mimi_recon.json- MIMI with reconstruction (main)
Key training parameters (configurable in JSON config files):
learning_rate: 1e-4batch_size: 42num_epochs: 100warmup_steps: 1000discriminator_start_steps: 10000- Loss weights for reconstruction, VQ, commit, feature matching, and adversarial losses
Single Files
python jhcodec/inference.py \
--config config/config_mimi_recon.json \
--checkpoint jhcodec_mimi_1000000.pt \
--input_file /path/to/input.wav \
--output_file /path/to/output.wav \
--num_codebooks 8 \
--device 'cuda'Multiple Files
python jhcodec/decode_eval.py \
--config config/config_dac_norecon.json \
--checkpoint /path/to/checkpoint_300000.pt \
--name jhcodec_dac_norecon \
--glob_pattern "/path/to/audio/*.wav" \
--out_dir "out_dir" \
--hierarchy 4Arguments:
--config: Path to configuration file--checkpoint: Path to model checkpoint--name: Model name for output directory--glob_pattern: Glob pattern for input audio files--hierarchy: Depth of quantization hierarchy (default: 4)--out_dir: Output directory
The decoding script supports various audio datasets:
- LibriSpeech:
/data/LibriSpeech/test-other/*/*/*.flac - TITW:
/data/titw/titw_hard/test/*.wav - MLS:
/data/MLS/mls_*/test/audio/*/*/*.flac
Configuration files are JSON-based and include:
- Model Architecture: Encoder/decoder layers, attention heads, embedding dimensions
- Vector Quantization: Codebook size, number of codebooks, embedding dimensions
- Training: Learning rate, batch size, loss weights, discriminator settings
- Data: Sample rate, segment duration, data directories
- Logging: Checkpoint intervals, tensorboard settings
Example configuration structure:
{
"model": {
"encoder": {...},
"decoder": {...},
"rvq": {
"type": "dac",
"num_codebooks": 8,
"codebook_size": 1024
}
},
"training": {...},
"loss": {...},
"data": {...}
}Anonymous. Contact: jhcodec843@gmail.com Submitted to Interspeech 2026
MIT License