VoiceCast turns a short audio sample into a voice you can use for text-to-speech — in 16 languages,
with expressive emotions, through a desktop app, command line, or Python API.
- You're building an audiobook, a game, or a prototype — and the voice matters, but hiring voice talent for every iteration is slow and expensive.
- You need multilingual narration but can't find a single voice that sounds natural across languages.
- Existing TTS tools produce robotic, flat output that doesn't match the expressiveness you need — no laughs, no sighs, no personality.
VoiceCast solves all three. Record 5–30 seconds of any voice, and generate natural, expressive speech in that voice — instantly, locally, for free.
- Any voice, cloned in seconds — Feed in a 5–30 second WAV sample and VoiceCast learns the voice. No training, no cloud upload, no waiting.
- 16 languages, one tool — English, Spanish, French, German, Chinese, Japanese, and 10 more. Switch languages without switching voices.
- Expressive speech that sounds human — Add
[laugh],[sigh],[gasp], and more with Chatterbox Turbo. Your cloned voice doesn't just talk — it performs. - Three ways to use it — A polished desktop GUI for quick tasks, a CLI for automation, and a Python API for integration into your own projects.
- Runs on your machine — No API keys, no cloud dependencies, no per-word billing. Your voice data stays local.
- Install — Clone the repo, create a virtual environment, and
pip install -e .— that's it. - Pick a voice sample — Any clean 5–30 second audio clip of the voice you want to clone.
- Choose your engine — Coqui XTTS v2 for multilingual quality, or Chatterbox for speed and expressiveness.
- Generate speech — Type your text, hit generate, and get a WAV file in the cloned voice.
| Engine | Languages | Speed | Best For |
|---|---|---|---|
| Coqui XTTS v2 | 16 | Medium | Multilingual narration, production quality |
| Chatterbox Turbo | English | Fast | Rapid iteration, expressive speech with emotion tags |
| Chatterbox Standard | English | Medium | High-fidelity English output |
# Clone the repository
git clone https://github.com/luongnv89/voice-cast.git
cd voicecast
# Create virtual environment
python3.10 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install
pip install -e .Launch the GUI:
python voice_cloning_app.pyOr use the CLI:
python vcloner.py -i voice.wav -t "Hello world" -o output.wavOr call the Python API:
from voice_cloner import VoiceCloner
cloner = VoiceCloner(speaker_wav="./voice-samples/speaker.wav")
cloner.say("Hello, this is my cloned voice!", save_audio=True, output_file="output.wav")Add expressive speech with Chatterbox Turbo:
cloner.say("That's hilarious [laugh]! I can't believe it [gasp]!")Supported tags: [laugh], [chuckle], [cough], [sigh], [gasp], [yawn]
Is VoiceCast free? Yes. VoiceCast is MIT licensed — free for personal and commercial use, forever. See LICENSE.
Does it need a GPU? No. VoiceCast runs on CPU. An NVIDIA GPU with CUDA speeds up generation significantly, and Apple Silicon users can install the optional MLX backend for hardware acceleration.
What are the system requirements? Python 3.10+, 8GB RAM (16GB recommended). Optional: NVIDIA GPU with CUDA or Apple Silicon with MLX.
How does Coqui compare to Chatterbox? Coqui XTTS v2 supports 16 languages and produces high-quality multilingual output. Chatterbox is English-only but faster and supports expressive emotion tags. Use both — VoiceCast makes switching engines seamless.
Is my voice data sent to the cloud? No. Everything runs locally on your machine. No API keys, no cloud uploads, no telemetry.
Can I use this in production? Yes. VoiceCast provides a Python API designed for integration. See the API Reference for details.
How long does the voice sample need to be? 5–30 seconds of clean speech. Longer samples can improve quality, but even 5 seconds produces usable results.
VoiceCast puts voice cloning in your hands — no cloud, no cost, no restrictions. Clone voices for audiobooks, games, accessibility tools, creative projects, or anything else you can imagine.
MIT licensed. Runs locally. Works on Linux, macOS, and Windows.
Documentation
| Document | Description |
|---|---|
| API Reference | Complete Python API documentation |
| CLI Reference | Command-line interface guide |
| GUI Guide | Desktop application user manual |
| Engines Guide | TTS engine comparison and parameters |
| Architecture | System design and patterns |
| Development | Contributing and setup guide |
| Troubleshooting | Common issues and solutions |
System Requirements
- Python 3.10+
- 8GB RAM (16GB recommended)
- NVIDIA GPU with CUDA (optional, for faster processing)
- Apple Silicon with MLX (optional, for hardware acceleration on Mac)
Optional: Install Chatterbox Engine
pip install -e ".[chatterbox]"Optional: Install MLX Backend (Apple Silicon)
pip install -e ".[mlx]"- Coqui TTS — XTTS v2 model
- Chatterbox — Fast TTS by Resemble AI
- PyTorch — Deep learning framework
