Multitalker Parakeet ASR

A multi-speaker speech recognition system using NVIDIA's Multitalker Parakeet Streaming model with speaker diarization.

Features

Multi-speaker transcription: Identifies and separates up to 4 speakers
Speaker diarization: Automatic speaker identification and labeling
Word-level timestamps: Precise timing for each word

Multiple output formats:
- Turn-based with overlap markers (best for conversations/debates)
- Colored word-by-word interleaving
- Speaker-tagged segments
- Detailed word timestamps
Automatic audio preprocessing: Converts stereo to mono, resamples to 16kHz
Streaming processing: Efficient chunk-based processing

Requirements

Hardware: NVIDIA GPU with CUDA support (tested on RTX 4090)
CUDA: Version 12.x
Python: 3.10+

Installation

Clone or navigate to this directory:
```
cd multitalkparakeet
```

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Verify the installation:
```
python verify_setup.py
```

Usage

Basic Usage

Use the wrapper script for proper CUDA library handling:

# Basic multi-speaker transcription
./run_transcribe.sh --audio your_audio.wav

# With quiet mode (suppresses verbose logging)
./run_transcribe.sh --audio your_audio.wav -q

Output Formats

Turn-Based with Overlap Markers (Recommended for conversations)

./run_transcribe.sh --audio your_audio.wav --turns -q

Output:

[0:01.35] SPK 0: Ukraine are now they're heavy-handed approach.
         └─[SPK 1 overlapping]: They're heavy-handed approach.
[0:04.58] SPK 0: They're heavy-handed approach.
         └─[SPK 1 overlapping]: You both have said Vladimir Putin...

Colored Word-by-Word Interleaving

./run_transcribe.sh --audio your_audio.wav --color -q

Shows each word in chronological order with speaker colors.

Word-Level Timestamps

./run_transcribe.sh --audio your_audio.wav --words -q

Output:

   1.35s -  1.89s : Ukraine [speaker_0]
   1.89s -  2.43s : are [speaker_0]
   3.67s -  4.20s : They're [speaker_1]

Simple Single-Speaker Mode

For faster processing without diarization:

./run_transcribe.sh --audio your_audio.wav --simple --words -q

Command-Line Options

Option	Description
`--audio`, `-a`	Path to audio file (required)
`--output`, `-o`	Output JSON file path (default: output_transcription.json)
`--turns`	Show turn-based transcript with overlap markers
`--color`	Show colored interleaved word-by-word output
`--words`	Show word-level timestamps
`--simple`	Use single-speaker mode (faster, no diarization)
`--quiet`, `-q`	Suppress verbose NeMo logging
`--cpu`	Force CPU usage (not recommended, very slow)

Combining Options

# Turn-based + word timestamps
./run_transcribe.sh --audio your_audio.wav --turns --words -q

# All visual outputs
./run_transcribe.sh --audio your_audio.wav --turns --color --words -q

Audio Requirements

Supported formats: WAV, MP3, FLAC, and other common formats
Automatic preprocessing:
- Stereo files are converted to mono
- Audio is resampled to 16kHz if needed
Recommended: 16kHz mono WAV for best performance

Models Used

This system uses two NVIDIA NeMo models:

Speaker Diarization: nvidia/diar_streaming_sortformer_4spk-v2.1
- Identifies up to 4 speakers
- Streaming-capable for real-time processing
ASR (Speech Recognition): nvidia/multitalker-parakeet-streaming-0.6b-v1
- 600M parameter RNNT model
- Optimized for multi-speaker scenarios
- Streaming-capable

Models are automatically downloaded from HuggingFace on first run (~2GB total).

File Structure

multitalkparakeet/
├── README.md              # This file
├── requirements.txt       # Python dependencies
├── run_transcribe.sh      # Main wrapper script (handles CUDA paths)
├── transcribe.py          # Core transcription logic
├── verify_setup.py        # Installation verification
├── generate_test_audio.py # Generate synthetic test audio
└── venv/                  # Python virtual environment

Troubleshooting

cuDNN Version Mismatch

If you see errors about cuDNN version incompatibility, the wrapper script run_transcribe.sh handles this by setting the correct library paths. Always use the wrapper script instead of running transcribe.py directly.

CUDA Out of Memory

The models require approximately 4-6GB of GPU memory. If you encounter OOM errors:

Close other GPU-intensive applications
Try processing shorter audio files
Use --simple mode for reduced memory usage

Slow First Run

The first run downloads models from HuggingFace (~2GB). Subsequent runs use cached models and start much faster.

Output Interpretation

Speaker Labels

speaker_0, speaker_1, etc. are automatically assigned
Labels are consistent within a single audio file
The model supports up to 4 simultaneous speakers

Overlap Markers

In --turns mode:

└─[SPK X overlapping]: indicates speech that overlaps with the current turn
Multiple overlapping speakers are shown in chronological order

Timestamps

Timestamps are in seconds from the start of the audio
Word-level timestamps have ~80ms resolution
Segment timestamps show the full duration of each speaker's contribution

License

This project uses NVIDIA NeMo models which are subject to NVIDIA's licensing terms. See NVIDIA NeMo for details.

Acknowledgments

NVIDIA NeMo team for the ASR and diarization models
HuggingFace for model hosting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multitalker Parakeet ASR

Features

Requirements

Installation

Usage

Basic Usage

Output Formats

Turn-Based with Overlap Markers (Recommended for conversations)

Colored Word-by-Word Interleaving

Word-Level Timestamps

Simple Single-Speaker Mode

Command-Line Options

Combining Options

Audio Requirements

Models Used

File Structure

Troubleshooting

cuDNN Version Mismatch

CUDA Out of Memory

Slow First Run

Output Interpretation

Speaker Labels

Overlap Markers

Timestamps

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
generate_test_audio.py		generate_test_audio.py
overlap.png		overlap.png
requirements.txt		requirements.txt
run_transcribe.sh		run_transcribe.sh
transcribe.py		transcribe.py
verify_setup.py		verify_setup.py
your_audio.wav		your_audio.wav

Folders and files

Latest commit

History

Repository files navigation

Multitalker Parakeet ASR

Features

Requirements

Installation

Usage

Basic Usage

Output Formats

Turn-Based with Overlap Markers (Recommended for conversations)

Colored Word-by-Word Interleaving

Word-Level Timestamps

Simple Single-Speaker Mode

Command-Line Options

Combining Options

Audio Requirements

Models Used

File Structure

Troubleshooting

cuDNN Version Mismatch

CUDA Out of Memory

Slow First Run

Output Interpretation

Speaker Labels

Overlap Markers

Timestamps

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages