Skip to content

TUSHARTAMRAKAR/VoiceScript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoiceScript



A production-grade, full-stack AI speech recognition system that transcribes any audio in any language — powered by a multi-stage AI pipeline combining vocal isolation, noise reduction, and transformer-based transcription.


🚀 Try Live Demo   ·   ⚡ Quick Start   ·   📐 Architecture   ·   ✨ Features



📸 Interface

VoiceScript UI


🧠 The Problem VoiceScript Solves

Most speech recognition tools fail in the real world because they assume clean, perfect audio. They struggle with:

  • 🎵 Background music underneath speech (YouTube videos, podcasts, interviews)
  • 🌍 Non-English languages — or mixed language audio
  • 🔊 Crowd noise and ambient sound degrading accuracy
  • ⏱️ Long audio files hitting API rate limits and time restrictions

VoiceScript solves all of this through a three-stage AI pipeline that processes audio before a single word is transcribed.


✨ Features

🎯 Core

  • 🎤 Live microphone recording with real-time timer
  • 📁 File upload — WAV, MP3, FLAC, OGG, WebM
  • 🔄 Any Language → English in one click
  • 🌐 Auto language detection badge
  • ♾️ No audio length limit — auto-chunked

🚀 Advanced

  • ⏱️ Timestamped transcript — every sentence tagged
  • 🌍 Translate to 55+ languages via Google Translate
  • 📄 Export to TXT — clean formatted file
  • 🎬 Export to SRT — YouTube-ready subtitles
  • 📑 Export to PDF — professional document

🔒 Privacy & Performance

  • 🏠 100% local processing — audio never leaves your machine
  • Whisper medium — 769M params, optimized for CPU
  • 🎵 Demucs vocal isolation — strips background music
  • 🔧 Auto cleanup — temp files deleted after transcription

🎨 UI/UX

  • 🌙 Premium dark interface — Space Grotesk + Inter
  • Smooth animations — fade, slide, pulse, heartbeat
  • 📱 Fully responsive — works on any screen size
  • 🎭 Animated background grid + floating orbs
  • 🏷️ 7 color-coded tech badges in footer

🛠 Tech Stack

Backend

Technology Version Purpose
Python 3.10+ Core backend language
Flask 3.0.3 REST API server + frontend serving
Flask-CORS 4.0.1 Cross-origin request handling
OpenAI Whisper medium Primary speech-to-text AI (769M params)
Facebook Demucs 4.0.1 Neural source separation — vocal isolation
pydub 0.25.1 Audio format conversion + preprocessing
deep-translator 1.11.4 Google Translate integration (55+ languages)
SpeechRecognition 3.11.0 Fallback recognition engine
NumPy 2.4.4 Numerical processing for Whisper
ffmpeg 8.1 Low-level audio codec processing

Frontend

Technology Purpose
HTML5 Semantic structure
CSS3 Dark theme, keyframe animations, responsive grid
Vanilla JavaScript ES6+ MediaRecorder API, Fetch API, DOM manipulation
Web Audio API Live microphone capture
Google Fonts Space Grotesk + Inter typography

Infrastructure

Technology Purpose
Docker Containerized deployment
Hugging Face Spaces Production hosting (16GB RAM, free tier)
GitHub Version control + CI

📐 Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         FRONTEND                                │
│              HTML5  ·  CSS3  ·  Vanilla JS ES6+                │
│                                                                 │
│  ┌─────────────────────┐    ┌──────────────────────────────┐   │
│  │   File Upload        │    │   Live Mic Recording         │   │
│  │   Drag & Drop        │    │   MediaRecorder API          │   │
│  │   MP3/WAV/FLAC/OGG  │    │   webm/ogg format            │   │
│  └──────────┬──────────┘    └──────────────┬───────────────┘   │
│             └──────────────┬───────────────┘                   │
│                            │ HTTP POST /transcribe              │
│                            │ multipart/form-data + mode         │
└────────────────────────────┼────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                       FLASK API                                 │
│                            ▼                                    │
│                  ┌──────────────────┐                          │
│                  │   app.py         │                          │
│                  │  POST /transcribe│                          │
│                  │  POST /translate │                          │
│                  │  GET  /languages │                          │
│                  │  GET  /health    │                          │
│                  └────────┬─────────┘                          │
│                           │                                     │
│              ┌────────────▼──────────────────────┐             │
│              │      BEAST MODE PIPELINE           │             │
│              │                                    │             │
│              │  STAGE 1 — pydub                   │             │
│              │  ├─ Any format → 16kHz mono WAV    │             │
│              │  ├─ Volume normalization            │             │
│              │  └─ Silence stripping              │             │
│              │              ↓                     │             │
│              │  STAGE 2 — Facebook Demucs         │             │
│              │  ├─ htdemucs model                 │             │
│              │  ├─ Neural source separation       │             │
│              │  ├─ Isolate vocal stem             │             │
│              │  └─ Discard music/noise/drums      │             │
│              │              ↓                     │             │
│              │  STAGE 3 — OpenAI Whisper Medium   │             │
│              │  ├─ 769M parameter transformer     │             │
│              │  ├─ beam_size=5  best_of=5         │             │
│              │  ├─ temperature=0  patience=2      │             │
│              │  ├─ condition_on_previous_text=True│             │
│              │  ├─ no_speech_threshold=0.25       │             │
│              │  ├─ word_timestamps=True           │             │
│              │  └─ Auto language detection        │             │
│              └────────────────────────────────────┘             │
│                           │                                     │
│           JSON response: { transcript, segments,                │
│                           detected_language, duration,          │
│                           word_count, engine }                  │
└───────────────────────────┼─────────────────────────────────────┘
                            │
                            ▼
              ┌─────────────────────────┐
              │     FRONTEND RENDERS    │
              │  · Transcript text      │
              │  · Language badge       │
              │  · Timestamps view      │
              │  · Translation panel    │
              │  · Export buttons       │
              └─────────────────────────┘

📁 Project Structure

VoiceScript/
│
├── backend/
│   ├── app.py              # Flask server — all API routes
│   ├── transcriber.py      # Beast mode pipeline: pydub → Demucs → Whisper
│   └── translator.py       # Google Translate integration (55+ languages)
│
├── frontend/
│   ├── index.html          # App structure — all UI elements
│   ├── style.css           # Dark theme, keyframe animations, responsive layout
│   └── app.js              # MediaRecorder, Fetch API, all interactivity
│
├── uploads/                # Temp audio storage (auto-deleted after processing)
│   └── .gitkeep
│
├── docs/
│   └── Screenshot.png      # UI screenshot
│
├── Dockerfile              # Docker container config for HF Spaces
├── .gitignore              # Excludes venv, uploads, cache
├── requirements.txt        # All Python dependencies
└── README.md               # This file

⚡ Quick Start

Prerequisites

1 — Clone

git clone https://github.com/TUSHARTAMRAKAR/VoiceScript.git
cd VoiceScript

2 — Virtual environment

python -m venv venv

# Windows
venv\Scripts\activate

# Mac/Linux
source venv/bin/activate

3 — Install dependencies

pip install -r requirements.txt

⚠️ First run downloads:

  • Whisper medium model — ~769MB (cached at ~/.cache/whisper/)
  • Demucs htdemucs model — ~300MB (cached automatically)

Both are one-time downloads. Every subsequent run loads instantly.

4 — Start the server

cd backend
python app.py

5 — Open the app

Open http://localhost:7860 in your browser. VoiceScript is running.


🚀 Live Demo

Deployed on Hugging Face Spaces · Docker · CPU Basic · Always On


🧪 How to Use

Upload a file:

  1. Drop any audio file onto the upload zone
  2. Choose mode — Transcribe or Any Language → English
  3. Click Transcribe File
  4. Get transcript + language badge + timestamps + export options

Record live:

  1. Click Start Recording — allow mic access
  2. Speak clearly
  3. Click Stop & Transcribe
  4. Transcript appears in seconds

Translate:

  • After transcription, the Translate panel appears
  • Pick any of 55+ languages from the dropdown
  • Click Translate — Google Translate does the rest

Export:

  • TXT — plain text file with metadata header
  • SRT — subtitle file with timestamps, ready for YouTube/VLC
  • PDF — opens a clean print-ready page, save as PDF

⚙️ Configuration

Change Whisper model in backend/transcriber.py:

# tiny | base | small | medium | large-v3
# medium = best CPU balance (default)
# large-v3 = maximum accuracy (needs GPU for practical speed)
WHISPER_MODEL_SIZE = "medium"

Change transcription language in backend/transcriber.py:

language = "en"   # en, hi, de, fr, es, ja, ko, zh...
# Remove language= entirely for auto-detection

🔮 Roadmap

  • GPU acceleration (CUDA) for large-v3 model
  • Real-time streaming transcription
  • Speaker diarization (who said what)
  • AI summarization of long transcripts
  • Transcript history (localStorage)
  • Docker compose for one-command setup
  • REST API documentation (Swagger/OpenAPI)
  • Chrome extension for transcribing browser audio

🤝 Contributing

# Fork → clone → create branch
git checkout -b feature/your-feature

# Make changes, then
git commit -m "feat: describe your change"
git push origin feature/your-feature

# Open a Pull Request

Please follow Conventional Commits.


👨‍💻 Author

Tushar Tamrakar

Full-Stack Developer · AI/ML Enthusiast · Builder

GitHub Email HuggingFace

"Built with curiosity, powered by caffeine, debugged at 2AM."


📄 License

MIT License — Copyright (c) 2026 Tushar Tamrakar

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software to deal in the Software without restriction,
including the rights to use, copy, modify, merge, publish, distribute,
sublicense, and/or sell copies of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.

star

Python · Flask · JavaScript · OpenAI Whisper · Facebook Demucs · pydub · ffmpeg · Docker · Hugging Face

About

🎙️ A full-stack AI-powered Speech Recognition web app. Upload audio or record live — transcribed instantly using OpenAI Whisper + Facebook Demucs vocal isolation. Built with Python, Flask & JavaScript.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors