vits_chinese is an implementation of the VITS end-to-end text-to-speech (TTS) architecture tailored for Chinese (and possibly multilingual) speech synthesis. VITS is a model combining variational autoencoders (VAEs), normalizing flows, adversarial learning, and a stochastic duration predictor — a design that enables generation of natural, expressive speech, capturing variations in rhythm and prosody. By customizing or porting VITS for Chinese, this project aims to produce high-quality TTS outputs in a language that can be challenging due to tones, pronunciation variability, and prosody. The repository offers full training and inference pipelines: preprocessing, mel-spectrogram generation, training scripts, and audio synthesis. For users who don’t train their own models, the project provides pre-trained checkpoints (or instructions) and expects integration with a vocoder during speech synthesis.
Features
- Chinese-language tuned VITS end-to-end TTS architecture with support for tone and prosody
- Non-autoregressive, parallel audio synthesis for fast generation
- Full pipeline including preprocessing, spectrogram generation, training and inference scripts
- Support for training with both single-speaker (e.g. LJSpeech-style) or multi-speaker datasets (with adaptation)
- Pretrained models or checkpoint compatibility for immediate inference without training from scratch
- Capability to express natural rhythm, tone, and expressive speech in Chinese, thanks to stochastic duration predictor