A Python implementation of the Jarvis text-to-voice assistant with animated digital face, LLM integration, and speech recognition.
Modular Python application following Clean Architecture, SOLID principles, and Design Patterns
- pyjarvis_shared: Shared types, messages, and centralized configuration
- pyjarvis_core: Core domain logic (text analysis, TTS processors, audio processing, animations)
- pyjarvis_service: Service layer for text-audio processing (TCP IPC)
- pyjarvis_cli: CLI application to send text to service
- pyjarvis_ui: Desktop UI with animated robot face (Pygame)
- pyjarvis_llama: LLM integration with speech recognition (Ollama + RealtimeSTT)
- Python 3.10+
- FFmpeg (for MP3 processing) — see the "Install FFmpeg" section below
- Optional: Ollama (for LLM features)
- See
requirements.txtfor all dependencies
pip install -r requirements.txt- Start the service in terminal 1:
python -m pyjarvis_service- Start the UI in terminal 2:
python -m pyjarvis_ui- Send text from terminal 3:
python -m pyjarvis_cli "Hello, I am Jarvis"You can also specify the language manually:
python -m pyjarvis_cli "Hola, soy Jarvis" --language es
python -m pyjarvis_cli "Olá, eu sou Jarvis" --language pt-BR- Start Ollama (if using local LLM):
ollama serve- Start the service and UI as above.
- Start the LLM CLI in another terminal:
python -m pyjarvis_llamaIn the LLM CLI you can:
- Type messages and press Enter for text input
- Use
/mto record audio from microphone (press Enter to stop) - Use
/lang <code>to change recognition language (STT) - Use
/persona <name>to change AI persona - Automatic language detection: The system automatically detects the language of LLM responses and uses the correct TTS voice
The UI can run standalone but is much more useful connected to the service:
python -m pyjarvis_uiTop-level layout (canonical):
pyJarvis/
├── README.md
├── requirements.txt
├── pyjarvis_shared/
├── pyjarvis_core/
├── pyjarvis_service/
├── pyjarvis_cli/
├── pyjarvis_ui/
├── pyjarvis_llama/
├── audio/
├── assets/
├── models/
└── docs/
pyjarvis_shared/: AppConfig, message types and shared utilitiespyjarvis_core/: TextAnalyzer, AnimationController, TTS factory and processorspyjarvis_service/: IPC server and TextProcessor orchestrationpyjarvis_ui/: Pygame-based UI (FaceRenderer, AudioPlayer, ServiceClient)pyjarvis_llama/: LLM CLI, Ollama client, STT recorder, personas
- Animated robot face with lip-sync and emotion-driven effects
- Multiple TTS engines (Edge-TTS default, gTTS available)
- Automatic language detection for TTS (Portuguese, English, Spanish) using
langdetectwith heuristic fallback - STT integration via RealtimeSTT / Whisper models
- LLM support via Ollama and configurable AI personas
- Multi-language support: Automatic detection and manual override for TTS language selection
- TCP-based IPC; UI registers for broadcast updates from service
All runtime configuration is centralized in pyjarvis_shared/config.py (AppConfig):
- TTS processor selection (
tts_processor) - Audio output directory and auto-delete behavior
- Edge-TTS voice mapping (
edge_tts_voices) - STT model and language
- Ollama base URL and model
- Language detection: Uses
langdetectlibrary with heuristic fallback for automatic language detection
PyJarvis includes automatic language detection for TTS voice selection:
- Primary method: Uses
langdetectlibrary for accurate language detection - Fallback: Heuristic-based detection using language-specific patterns and keywords
- Supported languages: Portuguese (pt-BR), English (en-US), Spanish (es-ES, es-MX, es-AR, etc.)
- Manual override: You can specify the language manually using the
--languageflag in CLI
The system automatically detects the language of:
- Text sent via CLI (if no language is specified)
- LLM responses in the interactive CLI (automatically detected and sent to TTS with correct language code)
- Portuguese:
pt,pt-BR,portuguese - English:
en,en-US,en-GB,english - Spanish:
es,es-ES,es-MX,es-AR,spanish,español
- Edge-TTS (Microsoft) is the default high-quality engine. Voice mapping is configurable by language.
- gTTS (Google) is supported as a fallback (requires internet and FFmpeg for MP3→WAV conversion).
- Automatic language detection ensures the correct voice is used for each language.
PyJarvis uses FFmpeg (via pydub) to convert MP3 audio (e.g., produced by gTTS) to WAV.
Windows options:
- Chocolatey (recommended):
choco install ffmpeg- winget:
winget install ffmpeg- Manual download:
- Download from: https://www.gyan.dev/ffmpeg/builds/
- Extract the ZIP and add the
binfolder to your PATH (e.g.C:\ffmpeg\bin). - Restart terminal.
Verify installation:
ffmpeg -versionIf you prefer not to install FFmpeg, use a TTS engine that emits WAV directly.
Configure voices in pyjarvis_shared/config.py (example):
from pyjarvis_shared import AppConfig
config = AppConfig()
config.edge_tts_voices = {
"pt-br": "pt-BR-HumbertoNeural",
"pt": "pt-BR-FranciscaNeural",
"en": "en-US-AriaNeural",
"en-us": "en-US-GuyNeural",
"es": "es-ES-ElviraNeural",
"es-es": "es-ES-ElviraNeural",
"es-mx": "es-MX-DaliaNeural",
"es-ar": "es-AR-ElenaNeural"
}pt-BR-FranciscaNeural- female (padrão)pt-BR-HumbertoNeural- malept-BR-AntonioNeural- malept-BR-BrendaNeural- femalept-BR-DonatoNeural- malept-BR-ElzaNeural- femalept-BR-FabioNeural- malept-BR-GiovannaNeural- femalept-BR-JulioNeural- malept-BR-LeilaNeural- femalept-BR-LeticiaNeural- femalept-BR-ManuelaNeural- femalept-BR-NicolauNeural- malept-BR-ThalitaNeural- femalept-BR-ValerioNeural- malept-BR-YaraNeural- female
en-US-AriaNeural- female (padrão)en-US-GuyNeural- maleen-US-JennyNeural- femaleen-US-AmberNeural- femaleen-US-AnaNeural- female (child)en-US-AshleyNeural- femaleen-US-BrandonNeural- maleen-US-ChristopherNeural- maleen-US-CoraNeural- femaleen-US-ElizabethNeural- femaleen-US-EricNeural- maleen-US-JacobNeural- maleen-US-JaneNeural- femaleen-US-JasonNeural- maleen-US-MichelleNeural- femaleen-US-MonicaNeural- femaleen-US-NancyNeural- femaleen-US-RogerNeural- maleen-US-SaraNeural- femaleen-US-TonyNeural- male
es-ES-ElviraNeural- female (padrão, Bright, Clear)es-ES-AlvaroNeural- male (Confident, Animated)es-ES-AbrilNeural- femalees-ES-ArabellaMultilingualNeural- female (Cheerful, Friendly, Casual, Warm, Pleasant)es-ES-ArnauNeural- malees-ES-DarioNeural- malees-ES-EliasNeural- malees-ES-EstrellaNeural- femalees-ES-IreneNeural- female (Curious, Cheerful)es-ES-IsidoraMultilingualNeural- female (Cheerful, Friendly, Warm, Casual)es-ES-LaiaNeural- femalees-ES-LiaNeural- female (Animated, Bright)es-ES-NilNeural- malees-ES-SaulNeural- malees-ES-TeoNeural- malees-ES-TrianaNeural- femalees-ES-VeraNeural- femalees-ES-XimenaNeural- female
es-MX-DaliaNeural- female (padrão)es-MX-DaliaMultilingualNeural- femalees-MX-BeatrizNeural- femalees-MX-CandelaNeural- femalees-MX-CarlotaNeural- femalees-MX-CecilioNeural- malees-MX-GerardoNeural- malees-MX-JorgeNeural- malees-MX-JorgeMultilingualNeural- malees-MX-LarissaNeural- femalees-MX-LibertoNeural- malees-MX-LucianoNeural- malees-MX-MarinaNeural- femalees-MX-NuriaNeural- femalees-MX-PelayoNeural- malees-MX-RenataNeural- femalees-MX-YagoNeural- male
es-AR-ElenaNeural- female (Bright, Clear)es-AR-TomasNeural- male
es-CO-SalomeNeural- femalees-CO-GonzaloNeural- male
es-CL-CatalinaNeural- female (Chile)es-CL-LorenzoNeural- male (Chile)es-PE-AlexNeural- male (Peru)es-PE-CamilaNeural- female (Peru)es-US-AlonsoNeural- male (US Spanish)es-US-PalomaNeural- female (US Spanish)es-UY-MateoNeural- male (Uruguay)es-UY-ValentinaNeural- female (Uruguay)es-VE-PaolaNeural- female (Venezuela)es-VE-SebastianNeural- male (Venezuela)
For a complete list of all available Spanish voices, run:
edge-tts --list-voices | grep "^es-"Quick test flow:
- Start the service:
python -m pyjarvis_service(should listen on 127.0.0.1:8888) - Start the UI:
python -m pyjarvis_ui(window should open and attempt to connect) - Send text via CLI:
python -m pyjarvis_cli "Hello, I am Jarvis"
Expected: service processes text, generates an audio file in ./audio/, broadcasts a VoiceProcessingUpdate, and the UI plays the audio while animating the face.
Testing checklist (manual):
- Service starts without errors
- UI connects to service
- CLI can send text
- Audio is generated and played
- Robot face animates during speech
- Audio files are cleaned up (if configured)
- LLM CLI connects to Ollama (if used)
- Speech recognition works (
/min LLM CLI) - Language detection works correctly (test with Portuguese, English, Spanish text)
- Manual language override works (
--languageflag in CLI)


