← Back to APIs

Voice

Healthcare Voice API

Medical-specialized speech recognition and synthesis. Proven accuracy in clinical environments with real-time streaming.

Why Medical-Specialized Voice?

Generic STT models struggle with medical terminology. Persly Voice is trained on healthcare-specific vocabulary for superior accuracy.

Drug NamesAccurate recognition of medication names
"Metformin 500mg twice daily"
Generic"Met four min 500 milligrams..."
Persly"Metformin 500mg twice daily"
Medical TermsComplex terminology recognition
"Acetaminophen overdose"
Generic"A set a mino fen over dose..."
Persly"Acetaminophen overdose"
Clinical AbbreviationsAbbreviation handling in context
"Left ACL tear suspected"
Generic"Left A C L tear..."
Persly"Left ACL tear suspected"

Medical Vocabulary

100K+ medical terms, drug names, and disease names

Context Understanding

Accurately distinguishes homophones in medical context

Multilingual Medical

Cross-language medical term recognition

Comprehensive Voice Features

Speech-to-Text (STT)

Real-time Streaming

WebSocket-based live transcription

Speaker Diarization

Automatic doctor/patient conversation separation

Timestamps

Sentence-level time information

Confidence Scores

Per-result confidence scoring

Text-to-Speech (TTS)

Natural Speech

Optimized tone for medical information delivery

Medical Pronunciation

Accurate drug and disease name pronunciation

Tone Control

Informational/warning/normal tone selection

Multiple Voices

Various speaker styles available

Technical Specifications

STT Specifications

Supported Languages70+ languages
Sample Rate16kHz / 48kHz
EncodingPCM, WebM, MP3
Latency< 200ms (streaming)
Max Audio Length4 hours (batch) / Unlimited (streaming)

TTS Specifications

Supported Languages70+ languages
Sample Rate22.05kHz / 48kHz
Output FormatPCM, MP3, OGG
First Byte Latency< 150ms
Max Text Length10,000 characters / request

Code Example

import asyncio
from persly import Voice

client = Voice(api_key="YOUR_API_KEY")

async def transcribe_stream():
    async for result in client.transcribe_stream(
        audio_stream=microphone_stream(),
        language="en",
        enable_medical_mode=True,
        speaker_diarization=True
    ):
        print(f"[{result.speaker}] {result.text}")
        print(f"  Confidence: {result.confidence}")
        print(f"  Medical terms: {result.medical_terms}")

asyncio.run(transcribe_stream())

Medical Speech Recognition Accuracy

Word Error Rate (WER) comparison on medical speech datasets. Lower is better.

Medical STT Accuracy Comparison

Persly Voice94.3
Google Medical78.2
AWS Transcribe Medical76.5
Whisper Large v362.1

Drug Name Accuracy %

4.2%
General Text WER

Word error rate on general medical conversations

6.8%
Medical Term WER

Word error rate on medical terminology

94.3%
Drug Name Accuracy

Correct recognition of medication names

<200ms
Streaming Latency

Real-time transcription delay

Real-time Performance

MetricPersly VoiceCompetitors Avg
First Result Latency180ms350ms
Streaming Delay< 200ms400-800ms
TTS First Byte120ms300ms

* WER = Word Error Rate (lower is better) * Benchmarked on 1,000 medical consultation recordings across 4 languages

Use Cases

Real-time Clinical Documentation

Live transcription of doctor-patient conversations with automatic speaker separation and EMR integration

Voice AI Agents

Hospital appointment bots, medication reminder calls, health consultation voice assistants

Medical Dictation

Voice-based prescription writing, medical report dictation, test result recording

Patient Education TTS

Medication guidance voice synthesis, pre-surgery instructions, multilingual patient information

FAQ

What's the difference between streaming and batch processing?

Streaming uses WebSocket for real-time results, ideal for live consultations. Batch processes entire audio files at once, better for transcribing recordings.

How is the medical vocabulary updated?

Monthly updates include new drug names and medical terms. Enterprise plans support custom vocabulary additions.

Is it HIPAA/GDPR compliant?

Yes. All voice data is encrypted in transit and deleted immediately after processing. We comply with HIPAA, GDPR, and local privacy regulations.

How does it integrate with existing RAG APIs?

Voice API works as an input/output layer for the Embed → Finder → Rerank → LLM pipeline. Use a single API key for all services.

Ready to Build with Persly?

Let's discuss how our APIs can power your healthcare product