Non-Verbal Sound Detection — Closed captions that capture every sound, not just words

VoxChron
Next Generation Accessibility Captioning

Automatic captions that go beyond words — every sound event captured.

Every sigh, laugh, door slam, and silence tells a story. VoxChron is the only AI platform that transcribes speech and captures every non-verbal sound in your audio — making it the only captioning tool that truly meets accessibility standards.

Trusted by podcasters, broadcasters, and accessibility teams

voxchron_output.json

[00:01:02] Speaker 1 (hesitant, low tone): I... I don't think we should go.

[00:01:04] [deep sigh]

[00:01:05] [silence — 2.3s]

[00:01:08] [door creaks open]

[00:01:09] [wind howling — background]

[00:01:11] [FOOTSTEPS — rapid, approaching]

[00:01:12] Speaker 2 (urgent, rising pitch): We have to. There's no other way.

[00:01:14] [nervous laughter]

[00:01:16] [keys jangling]

Real output from VoxChron — every sound captured, timestamped, and labeled.

What is VoxChron?

VoxChron is an AI-powered closed captioning platform that transcribes speech and detects non-verbal sounds — including sighs, laughter, door slams, silence, and 500+ other sound events. Unlike standard transcription tools that capture only spoken words, VoxChron annotates every audible event with precise timestamps and labels, producing captions that automatically meet FCC §79.1, ADA Title III, WCAG 2.1 AA, and Section 508 accessibility requirements. It supports pre-recorded files up to 10 GB, live streams via RTMP/HLS/WebRTC, and exports to SRT, VTT, JSON, TXT, and DOCX formats. Accuracy exceeds 98% across 500+ sound categories.

🔊500+Non-verbal sounds detected
🎯98%Detection accuracy
🎬26Context modes (content + genre)
🆓5 minFree — no credit card
[laughter][sigh][applause][coughing][door slam][footsteps][glass breaking][phone ringing][wind howling][crowd murmuring][typing][birds chirping][engine revving][thunder][baby crying][dog barking][laughter][sigh][applause][coughing][door slam][footsteps][glass breaking][phone ringing][wind howling][crowd murmuring][typing][birds chirping][engine revving][thunder][baby crying][dog barking]

The Problem

Other transcription tools only capture half the story

Standard transcription gives you words. But audio is so much more than words. The laughter that breaks tension. The silence that speaks volumes. The door slamming that changes the mood. The trembling voice that reveals fear.

VoxChron captures it all. Our AI detects and annotates 500+ distinct non-verbal sounds, emotions, and environmental audio events — automatically.

Other tools

"I don't think we should go."

VoxChron
00:00:01,200 → 00:00:01,800[sigh]
00:00:02,000 → 00:00:04,600"I... I don't think we should go."
00:00:04,700 → 00:00:07,000[silence — 2.3s]
00:00:07,100 → 00:00:07,400[door creaks]
00:00:07,500 → 00:00:09,200[wind — background]

What We Detect

Six layers of non-verbal intelligence

Every audio file contains layers of meaning. VoxChron detects them all and gives you granular control over what to include in your output.

Speech

Word-level transcription with speaker diarization. Verbatim or clean modes.

"I... I don't think we should go."

Paralanguage

Hesitations, filler words, sighs, laughter, throat clearing, stammering — the sounds people make that aren't words.

[sigh], [laughter], [throat clearing], [um], [gasp]

Non-Word Sounds

Coughs, sneezes, yawns, whistles, claps, snaps — human-generated sounds that carry meaning but aren't speech.

[coughing], [sneezing], [clapping], [whistling]

Environmental Sounds

Traffic, wind, rain, crowd noise, silence, music — ambient sound captions that set the scene for every listener.

[traffic], [rain], [crowd murmuring], [silence — 3.2s]

Sound Effects & Foley

Door slams, footsteps, glass breaking, explosions, phone rings — sound effect subtitles with precise timestamps for every discrete event.

[door slam], [footsteps], [glass shattering], [gunshot]

Emotional Cues

AI-detected emotional tone per utterance — happy, sad, angry, fearful, surprised, neutral — from voice alone.

(angry, rising pitch), (whispered, fearful), (excited)

How It Works

From upload to compliant captions in minutes

  1. 1
    📁

    Upload your file

    Drag and drop any audio or video file up to 10 GB — MP3, WAV, MP4, MOV, M4A, FLAC, OGG, WEBM, and more. Or connect a live RTMP/HLS/WebRTC stream.

  2. 2
    🧠

    AI analyzes 6 signal layers

    VoxChron processes speech, paralanguage, non-word sounds, environmental sounds, sound effects, and emotional cues simultaneously — detecting 500+ distinct sound categories.

  3. 3

    Download compliant captions

    Export to SRT, VTT, JSON, TXT, or DOCX with every sound timestamped and labeled. Automatically meets FCC, ADA, WCAG 2.1, and Section 508 requirements.

Context Modes

Adapts to any content type

Tell VoxChron what you are captioning — even the specific movie genre — and it automatically adjusts which non-verbal sounds to prioritise, how to format them, and which compliance rules to apply. Horror gets silence and jump-scare detection. Comedy gets laughter and timing. Legal gets every breath and pause. No other captioning tool does this.

Content Types

📄General
🎙️Podcast
Accessibility
📰News
⚖️Legal / Interview
📡Live Broadcast
📱Vlog / Social Media
🎓Education
🏢Training Video
📢Marketing Video

Movie Genres — Detection adapts to the genre

💥Movie — Action
😂Movie — Comedy
👻Movie — Horror
🎭Movie — Drama
🔪Movie — Thriller
💕Movie — Romance
🚀Movie — Sci-Fi
🧙Movie — Fantasy
🎥Movie — Documentary
🎵Movie — Musical
Movie — Animation
🤠Movie — Western
🔫Movie — Crime
💣Movie — War
🗺️Movie — Adventure
🕵️Movie — Film Noir

Each mode tunes detection weights, sound priorities, and formatting rules automatically.

Compliance

Compliance-ready out of the box

Non-verbal sound annotations formatted to meet real-world regulatory standards. No manual cleanup needed.

FCC

Broadcast-grade closed captions with sound annotations

ADA

Accessible captions with non-verbal context for hearing impaired

Legal

Verbatim transcripts with pauses, hesitations, and sounds

WCAG

Web-accessible media with full audio description

Accessibility First

The only captioning tool that truly meets accessibility standards

Every accessibility guideline — FCC, ADA, WCAG 2.1, Section 508 — requires that captions include non-verbal sounds. No other tool does this automatically. Until now.

Every other captioning tool

Non-compliant

[00:01:02]

"I don't think we should go."

[00:01:05]

— nothing captured —

[00:01:11]

"We have to go. There's no other way."

Missing [sigh] — emotional context lost
Missing [silence — 2.3s] — dramatic pause invisible
Missing [door creaks] — scene change undetected
Missing [footsteps] — approaching danger unknown

A hearing impaired viewer misses the entire emotional scene

VoxChron

Meets requirements

[00:01:02]

(hesitant) "I... I don't think we should go."

[00:01:04]

[deep sigh] [silence — 2.3s] [door creaks] [wind]

[00:01:11]

[FOOTSTEPS — rapid, approaching]

(urgent) "We have to. There's no other way."

Every sigh, pause, and sound — captured
Emotional tone annotated per speaker
Environmental sounds set the scene
A hearing impaired viewer experiences the full story

True accessibility — nothing is lost

Why this matters — what the standards actually require

📺

FCC §79.1

Captions must include non-speech information like sound effects and speaker identification

⚖️

ADA Title III

Effective communication requires captions that convey the full auditory experience

🌐

WCAG 2.1 AA

Guideline 1.2.2: Captions must include identification of non-speech sounds needed to understand content

🏛️

Section 508

Federal media must include captions with all significant sound effects and speaker changes

Every standard requires non-verbal sounds in captions — because true captions for the hearing impaired must convey the full auditory experience, not just words. VoxChron is the only tool that does this automatically.

📺

Broadcasters

🎬

Filmmakers

🎓

Universities

🏛️

Government

🏥

Healthcare

⚖️

Legal

Ready to meet accessibility standards without manual work?

Start free — 5 minutes included

Comparison

VoxChron vs other captioning tools

Most transcription tools capture words. Only VoxChron captures everything.

FeatureVoxChronDescriptRevOtter.ai
Speech transcription
Non-verbal sound detection (500+)
6 signal layers (paralanguage, environmental, emotional)
Speaker diarization
Emotional tone detection
Silence & pause annotation
Full Hearing Impaired Mode (Full CC)
Content-type adaptation (26 modes)
Adjustable detection sensitivity
FCC/ADA/WCAG/508 auto-compliancePartial
Live streaming transcription
Translation (75+ languages)
Dual subtitles (original + translated)
SRT/VTT/JSON/TXT/DOCX exportPartial
REST API access
Pay-as-you-go (no subscription required)
GDPR compliant / EU data processing

Comparison based on publicly available features as of 2026. VoxChron is the only platform purpose-built for non-verbal sound detection in captions.

Pricing

Simple, transparent pricing

Pay only for what you use. No hidden fees. Start with 5 free minutes.

ServicePriceTurnaround
AI Transcription
Speech-to-text with word-level timestamps
$0.25/min1-5 minutes
AI Non-Verbal Sounds
All 6 signal layers — every sound captured
$0.45/min1-5 minutes
Full Hearing Impaired ModeIncluded
Speaker labels, music notation, all background sounds — FCC/ADA compliant. 4 sensitivity presets from Key Moments to Full CC.
Included with Non-Verbal plan — no extra cost1-5 minutes
Translation Add-onNew
Translate transcripts to 75+ languages — optional non-verbal label translation included
+$1.50/min1-5 minutes
Dual Subtitles Add-onNew
Original + translated subtitles in one file — top/bottom, side-by-side, or inline formats
+$0.50/min1-5 minutes
Live StreamingNew
Live Clean Verbatim
Real-time clean transcription — readable, no filler words
$0.50/minReal-time (~300ms)
Live Verbatim
Real-time verbatim — includes filler words (um, uh), hesitations
$0.80/minReal-time (~300ms)
Live Non-Verbal Detection
Add [laughs], [sighs], [coughs], [pause] tags to your live stream
+$0.80/minReal-time (~2s buffer)
Free Trial
Free
5 min included
  • Transcription only
  • SRT & JSON export
  • 3 context modes
Get started
Pay As You Go
Usage-based
No monthly fee
  • All signal layers
  • Full Hearing Impaired Mode
  • All export formats
  • API access (10 req/min)
Get started
Most Popular
Pro
$49/mo
120 min included
  • Priority processing
  • Full Hearing Impaired Mode
  • API access (60 req/min)
Get started
Business
$149/mo
500 min included
  • Dedicated support
  • Full Hearing Impaired Mode
  • API access (300 req/min)
Get started

All paid plans include API access. USDC payments supported via Stripe.

Need more volume? Contact us for a custom plan.

For Developers

Built for developers

Integrate non-verbal sound intelligence into your own apps. One API call to transcribe speech and detect every sound.

  • Simple REST API with structured JSON responses
  • API key authentication with rate limiting
  • All 6 signal layers configurable per request
  • Export to SRT, VTT, DOCX, or JSON
  • Webhook callbacks on job completion

# Transcribe with non-verbal sound detection

curl -X POST https://api.voxchron.com/v1/transcribe \

-H "Authorization: Bearer vs_live_xxx" \

-H "Content-Type: application/json" \

-d '{"file_url": "https://...",

"transcription_mode": "verbatim",

"context_mode": "podcast",

"signal_layers": {

"paralanguage": true,

"non_word_sounds": true,

"environmental": true,

"sound_effects": true,

"emotions": true}}'

Live Streaming

Real-time transcription for live streams

Connect any RTMP, HLS, or WebRTC stream and get transcripts delivered in real-time with ~300ms latency. Optionally detect non-verbal sounds live.

6 languagesEnglish, Spanish, French, German, Italian, Portuguese
2 modesClean Verbatim ($0.50/min) or Verbatim with filler words ($0.80/min)
Non-verbal add-onDetect [laughs], [sighs], [coughs], [pause] live (+$0.80/min)
Instant exportsDownload SRT, VTT, JSON, or TXT when the stream ends
Live
12:34$1.24
[0:05]
[applause] Thank you everyone for joining today's live event.
[0:12]
We have some incredible announcements to share with you.
[0:18]
[laughter] I know, I know — everyone's excited.
[0:24]
[pause — 2s] So let's get started with the first demo.
[0:31]
[crowd murmuring] Listening...

FAQ

Frequently asked questions

What exactly does VoxChron detect?
VoxChron detects 500+ non-verbal sounds across 6 signal layers: speech (with speaker diarization), paralanguage (sighs, laughter, hesitations), non-word sounds (coughs, claps), environmental sounds (traffic, rain, silence), sound effects (door slams, footsteps), and emotional cues (tone, pitch changes). Every sound is timestamped and categorized.
How is this different from regular transcription?
Regular transcription only captures words. VoxChron captures everything — the sigh before someone speaks, the 3-second silence that changes the meaning, the door slam in the background. This is critical for accessibility compliance, where captions must include non-verbal sounds by law (FCC, ADA, WCAG 2.1).
What file formats do you support?
We accept all major audio and video formats: MP3, WAV, MP4, MOV, M4A, FLAC, OGG, WEBM, and more. Maximum file size is 10 GB. For output, we export to SRT, VTT, JSON, TXT, and DOCX.
How does live streaming transcription work?
Connect any RTMP, HLS, or WebRTC stream URL and VoxChron transcribes it in real-time with ~300ms latency. You can optionally enable non-verbal detection on your live stream. When the stream ends, download the full transcript instantly in any format.
What languages do you support?
Pre-recorded transcription supports 99 languages. Translation is available into 75+ languages. Live streaming currently supports 6 languages: English, Spanish, French, German, Italian, and Portuguese.
Is VoxChron compliant with accessibility standards?
Yes — VoxChron is the only captioning tool that automatically meets FCC §79.1, ADA Title III, WCAG 2.1 AA, and Section 508 requirements. All these standards require captions to include non-verbal sounds, which VoxChron does by default.
How accurate is the non-verbal detection?
Our AI model is trained on 500+ sound categories and achieves an overall accuracy rate exceeding 98% — independently measured across common non-verbal sounds like laughter, coughing, applause, and environmental audio.
What is Full Hearing Impaired Mode?
Full Hearing Impaired Mode is a free upgrade included with every Non-Verbal Sounds job. When enabled, it produces a separate Full CC subtitle file that adds speaker labels (e.g. [MARIA]), music notation (♪ [upbeat music]), and all detected background sounds — including minor ones like [clock ticking] or [bird chirping in distance] — not just the key sounds. Standard mode captures sounds above 80% confidence. Full CC mode drops the threshold to 45% so nothing is missed. You choose from 4 sensitivity presets: Key Moments Only, Recommended, Enhanced Detection, or Full Hearing Impaired Mode.
Do you offer an API?
Yes. VoxChron provides a full REST API for automated transcription and non-verbal detection. All plans include API access with rate limits based on your tier. You can integrate VoxChron directly into your workflow, CMS, or media pipeline.
How does pricing work?
Pay only for what you use — no subscriptions required. AI Transcription is $0.25/min and Non-Verbal Detection is $0.50/min. We also offer Pro ($49/mo) and Business ($149/mo) plans with included minutes and lower per-minute rates. Start with 5 free minutes, no credit card needed.
Can I try it for free?
Absolutely. Every new account gets 5 free minutes of processing — no credit card required. Upload any audio or video file and see VoxChron in action. You can also redeem promotional coupon codes for additional free minutes.
How does VoxChron compare to Descript, Rev, or Otter.ai?
Descript, Rev, and Otter.ai are excellent transcription tools — but they only capture spoken words. VoxChron is the only platform that also detects 500+ non-verbal sounds (sighs, laughter, door slams, silence, emotional tone) across 6 signal layers. This makes VoxChron the only tool that automatically produces FCC/ADA/WCAG/Section 508-compliant captions without manual annotation. Other tools require you to add non-verbal sounds by hand — VoxChron does it automatically.
What is render-time threshold filtering?
After VoxChron processes your audio, all detected sounds are stored with their confidence scores. You can then switch between 4 detection sensitivity presets — Key Moments Only (90%), Recommended (80%), Enhanced (65%), or Full CC (45%) — without reprocessing the file. The output is filtered instantly at render time, so you can find the perfect balance between detail and readability in seconds.
What content types and genres does VoxChron support?
VoxChron offers 26 context modes: 10 content types (General, Podcast, Accessibility, News & Broadcast, Legal/Interview, Live Broadcast, Vlog/Social Media, Education, Training Video, Marketing Video) and 16 movie genres (Action, Comedy, Horror, Drama, Thriller, Romance, Sci-Fi, Fantasy, Documentary, Musical, Animation, Western, Crime, War, Adventure, Film Noir). Each mode automatically adjusts which sounds to prioritize, how to format annotations, and which compliance rules to apply.
How does VoxChron handle speaker diarization?
VoxChron automatically identifies and labels individual speakers throughout your audio using AI-powered speaker diarization. Each speaker is assigned a unique label (e.g. [Speaker 1], [Speaker 2]) and in Full Hearing Impaired Mode you can assign custom names (e.g. [MARIA], [JAMES]). Speaker changes are timestamped alongside non-verbal sounds, so your captions show exactly who said what and what sounds occurred between speakers.
Can VoxChron be used for podcast transcription?
Yes — VoxChron is ideal for podcast transcription because it captures far more than words. In Podcast mode, VoxChron detects laughter, cross-talk, sighs, long pauses, applause, and background sounds that give listeners the full context. Speaker diarization labels each host and guest automatically. Export to SRT, VTT, or TXT for show notes, accessibility, or SEO-optimized transcripts that search engines can index.
Is my data secure on VoxChron?
Yes. All uploads are encrypted in transit (TLS 1.3) and at rest (AES-256). Source files are automatically deleted from our servers after processing completes — only your subtitle exports are retained. VoxChron is fully GDPR-compliant with data processing agreements, a named Data Protection Officer, and you can export or delete all your personal data at any time from your account settings.
How fast does VoxChron process audio and video files?
VoxChron processes most files at 5–10x real-time speed. A 60-minute podcast typically completes in 6–12 minutes. Live streaming transcription runs in real-time with approximately 300ms latency. Processing speed depends on file length, selected services (transcription only vs. transcription + non-verbal detection), and current server load. All plans use the same processing infrastructure — there is no speed difference between free and paid tiers.

Trusted by creators & enterprises worldwide

8M+Minutes Captioned
98%Accuracy Rate
4.9★User Rating

“VoxChron cut our captioning costs by 90% and turnaround from days to minutes. It's the only tool that captures everything — laughs, pauses, background sounds. Nothing else comes close.”

— Sarah Chen, Head of Content at StreamCore

“We needed FCC-compliant captions for 200+ hours of broadcast content. VoxChron detected non-verbal sounds we didn't even know were there. The Full CC mode is a game-changer for accessibility teams.”

— Marcus Taylor, Compliance Director at BroadcastOne

“As a podcast producer, I tried every transcription tool out there. VoxChron is the only one that catches the sighs, laughter, and awkward pauses that make conversations real. My deaf listeners finally get the full experience.”

— Elena Ruiz, Producer at The Daily Dialogue

Stop losing the sounds that matter

Upload your first audio file and see every non-verbal sound in your content — 5 minutes free, no credit card required.

Get started free

Share VoxChron

Your data stays private

Built for institutions, healthcare, legal, and enterprise teams with strict data requirements.

Read our Privacy Policy →
🔒

AES-256 Encryption

All uploads and processed files are encrypted in transit and at rest using AES-256 — the same standard used by banks and governments.

🚫

Never Used for Training

Your audio, video, and transcript data is never used to train AI models — not ours, not third parties. Your content stays yours.

GDPR Compliant

VoxChron is fully GDPR compliant. Data is processed within the EU and retained only as long as necessary for your job.