openvad is a lightweight voice activity detection library with a C++17 signal
processing core and a Python API. It is designed for low-latency segmentation of
PCM speech audio without model downloads or runtime services.
The detector uses short-time energy, zero-crossing rate, adaptive noise-floor tracking, hysteresis thresholds, and segment post-processing. This makes it fast, portable, and explainable. It is a practical baseline for streaming, ASR pre-processing, diarization pre-filters, and batch dataset cleanup.
- C++ core exposed through
pybind11. - Python 3.10+ API with typed dataclasses.
uv-friendly packaging and development workflow.- CLI for WAV files.
- No external model weights.
- Frame probabilities, energy traces, and final speech segments.
- Conservative defaults with tunable aggressiveness.
For local development:
uv sync --extra dev
uv run pytestBuild a wheel:
uv run python -m buildInstall editable during development:
uv pip install -e ".[dev]"uv run openvad input.wav
uv run openvad input.wav --json
uv run openvad input.wav --aggressiveness 2 --onset-threshold 0.62Output example:
0.240s - 1.830s ( 1.590s, confidence=0.812)
from openvad import VoiceActivityDetector, read_wav
samples, sample_rate = read_wav("speech.wav")
detector = VoiceActivityDetector(aggressiveness=1)
result = detector.analyze(samples, sample_rate)
for segment in result.segments:
print(segment.start, segment.end, segment.confidence)You can also pass a mono numpy.float32 array directly:
from openvad import detect
result = detect(samples, sample_rate=16_000, speech_pad_ms=60)VadConfig exposes the main detector controls:
| Parameter | Default | Meaning |
|---|---|---|
frame_ms |
20.0 |
Analysis window length. |
hop_ms |
10.0 |
Step between adjacent frames. |
onset_threshold |
0.58 |
Probability needed to enter speech. |
offset_threshold |
0.42 |
Probability needed to remain in speech. |
min_speech_ms |
80 |
Drop shorter speech islands. |
min_silence_ms |
120 |
Fill shorter silence gaps inside speech. |
speech_pad_ms |
40 |
Expand final speech regions on both sides. |
aggressiveness |
1 |
0 is permissive, 3 is strict. |
Recommended starting points:
- Clean microphone speech:
aggressiveness=1. - Noisy environment:
aggressiveness=2or raiseonset_threshold. - Avoid clipping words at boundaries: increase
speech_pad_msto60-100. - Very short commands: lower
min_speech_msto40.
This project intentionally starts with a high-performance statistical VAD rather than a neural model. It performs well when speech energy is measurably above the local noise floor. It is less suitable for music-heavy audio, overlapping speakers in loud rooms, or speech buried under non-stationary noise.
For production systems, evaluate on your target audio and tune thresholds with
held-out data. The exposed frame-level probability, energy_db, and zcr
arrays are meant to make this tuning straightforward.
native/vad_core.cpp: frame feature extraction, adaptive probability, and post-processing.src/openvad/api.py: public detector API and frame-to-segment conversion.src/openvad/io.py: small PCM WAV reader with mono downmixing.src/openvad/cli.py: command-line interface.tests/: synthetic regression tests.
Documentation is split by language:
uv sync --extra dev
uv run ruff check .
uv run pytestThe native extension is compiled by setuptools and pybind11; CMake is not
required.
Generate a synthetic labeled dataset:
uv run python tools/make_synthetic_dataset.py --output data/synth --count 20Evaluate a labeled manifest:
uv run python tools/evaluate_dataset.py data/synth/manifest.jsonlSweep thresholds:
uv run python tools/sweep_thresholds.py data/synth/manifest.jsonlInspect one file with an HTML report:
uv run python tools/inspect_file.py data/synth/sample_000.wav --output report.htmlMore details:
- English: docs/en/tools.md
- 中文:docs/zh/tools.md
MIT