Voice System Guide

Advanced voice processing, transcription, and text-to-speech capabilities for Discord bots

1. Overview 2. Voice Processing Pipeline 3. Speech-to-Text (Whisper) 4. Text-to-Speech (TTS) 5. Voice Settings and Configuration 6. Multi-language Support 7. Usage Limits and Analytics 8. Troubleshooting

Overview

Friendify's Voice System provides comprehensive voice processing capabilities, enabling your Discord bots to understand spoken messages and respond with natural-sounding speech.

Key Features

Advanced Speech Recognition: Multiple Whisper models for accurate transcription
High-Quality TTS: Natural-sounding text-to-speech in multiple languages
Real-time Processing: Low-latency voice processing for responsive interactions
Noise Filtering: Automatic background noise reduction and audio enhancement
Multi-language Support: Support for Turkish, English, and many other languages
Smart Audio Detection: Automatic detection of speech vs. background noise

Performance Optimized: Our voice system is optimized for Discord's voice infrastructure with minimal latency and high reliability.

Voice Processing Pipeline

Understanding how voice is processed helps you optimize your bot's voice interactions.

Audio Capture

Capture voice from Discord channels

Audio Processing

Noise reduction and enhancement

Transcription

Convert speech to text using Whisper

AI Response

Generate and speak AI response

Audio Quality Enhancement

Noise Suppression: Remove background noise and echo
Volume Normalization: Automatic gain control for consistent levels
Frequency Filtering: Optimize audio for speech recognition
Silence Detection: Automatically detect speech boundaries
Format Conversion: Support for various audio formats and codecs

Audio Processing Settings:
- Sample Rate: 16kHz (optimized for speech)
- Bit Depth: 16-bit
- Channels: Mono (converted from stereo if needed)
- Format: WAV/PCM for processing
- Noise Gate: -40dB threshold
                

Speech-to-Text (Whisper)

Friendify uses OpenAI's Whisper models for accurate speech transcription with multiple model options for different use cases.

Whisper Tiny

Speed: ~10x realtime
Size: 39MB

Fast, basic accuracy

Whisper Base

Speed: ~7x realtime
Size: 74MB

Balanced speed/quality

Whisper Small

Speed: ~4x realtime
Size: 244MB

High accuracy (recommended)

Online Whisper

Speed: ~2x realtime
API-based

Highest accuracy

Transcription Features

Multi-language Detection: Automatic language identification
Punctuation and Capitalization: Proper text formatting
Speaker Recognition: Distinguish between different speakers
Timestamp Generation: Word-level timing information
Confidence Scoring: Quality assessment of transcriptions

Model Selection Strategy

Friendify automatically selects the best model based on:

Audio Quality: Higher quality audio uses more accurate models
Length: Shorter clips use faster models for responsiveness
Language: Some models perform better for specific languages
User Preferences: Manual model selection in settings
Fallback Chain: Automatic fallback if primary model fails

Smart Fallback: If the primary model fails or produces low-confidence results, the system automatically tries alternative models.

Text-to-Speech (TTS)

Generate natural-sounding speech responses with various voice options and customization settings.

Neural Emma (EN)

Female, American English

Clear pronunciation
Natural intonation
Professional tone

Neural Ryan (EN)

Male, American English

Deep, warm voice
Confident delivery
Versatile tone

Neural Emel (TR)

Female, Turkish

Native Turkish speaker
Expressive delivery
Cultural awareness

Neural Ahmet (TR)

Male, Turkish

Clear articulation
Friendly tone
Natural accent

TTS Customization

Speech Rate: Adjust speaking speed (0.5x to 2.0x)
Pitch Control: Modify voice pitch (-50% to +50%)
Volume: Set output volume level
Emphasis: Add stress to specific words
Pauses: Insert custom pauses and breaks
Pronunciation: Custom pronunciation for specific terms

SSML Example:
<speak>
  <prosody rate="1.2" pitch="+10%">
    Merhaba! <break time="500ms"/>
    Bugün <emphasis level="strong">harika</emphasis> bir gün!
  </prosody>
</speak>
                

Audio Output Options

Bitrate: 128kbps, 192kbps, or 320kbps quality
Format: MP3, WAV, or OGG output formats
Stereo/Mono: Channel configuration options
Normalization: Automatic volume leveling
Compression: Dynamic range compression for Discord

Voice Settings and Configuration

Customize your bot's voice behavior through comprehensive settings in the dashboard.

Auto Voice Detection

Automatically join voice channels when users speak

Enabled

Voice Response

Respond with voice instead of text when in voice channels

Enabled

Noise Gate Threshold

Minimum audio level to trigger transcription

Maximum Recording Length

Maximum duration for voice recordings

Preferred Whisper Model

Default transcription model for voice processing

TTS Voice

Default text-to-speech voice for responses

Performance Impact: Higher quality settings may increase processing time and resource usage. Test different configurations to find the optimal balance.

Multi-language Support

Friendify's voice system supports multiple languages with automatic detection and appropriate processing.

Supported Languages

Primary Support

Turkish (TR)
English (EN)

Additional Languages

German (DE)
French (FR)
Spanish (ES)
Italian (IT)
Portuguese (PT)
Russian (RU)

Language Detection

Automatic Detection: Whisper automatically identifies the spoken language
Confidence Scoring: Each detection includes a confidence level
Manual Override: Force specific language processing if needed
Mixed Language: Handle conversations with multiple languages
Fallback Languages: Preferred language order for ambiguous audio

Language Adaptation: The bot can adapt its personality and response style based on the detected language and cultural context.

Cross-language Features

Translation: Automatic translation between supported languages
Code-switching: Handle mixed-language conversations
Cultural Adaptation: Adjust responses for cultural appropriateness
Accent Recognition: Better recognition of regional accents

Usage Limits and Analytics

Monitor and manage your voice processing usage with built-in analytics and configurable limits.

Usage Tracking

Daily Limits: Track daily voice processing minutes
Monthly Quotas: Monitor monthly usage across all bots
Per-Bot Limits: Set individual limits for each bot
User Quotas: Limit voice usage per Discord user
Quality Metrics: Track transcription accuracy and success rates

Default Limits:
- Free Tier: 100 minutes/day, 1000 minutes/month
- Premium: 500 minutes/day, 5000 minutes/month
- Enterprise: Custom limits available
- Per-user: 20 minutes/day per Discord user
                

Analytics Dashboard

Usage Graphs: Visual representation of voice processing over time
Success Rates: Track transcription success and failure rates
Language Statistics: Distribution of processed languages
Performance Metrics: Average processing times and latencies
Cost Tracking: Monitor API costs and resource usage

Quota Management: When limits are reached, voice processing is temporarily disabled. Users will receive text responses instead.

Troubleshooting

Common voice system issues and their solutions.

Audio Quality Issues

Poor Transcription: Check microphone quality and background noise levels
Bot Not Responding to Voice: Verify voice detection settings and noise gate threshold
Choppy Audio Output: Check network connection and Discord voice quality settings
Wrong Language Detection: Manually set preferred language in bot settings

Performance Issues

Slow Transcription: Switch to faster Whisper model (Tiny or Base)
High Latency: Check server location and network conditions
Processing Failures: Verify audio format compatibility and file size limits
Memory Issues: Restart bot if processing long audio files

Common Fixes

Restart Voice Connection: Have the bot leave and rejoin the voice channel
Check Permissions: Ensure bot has "Connect" and "Speak" permissions
Update Audio Drivers: Ensure Discord has proper audio device access
Clear Audio Cache: Clear temporary audio files if experiencing issues
Check Quota: Verify you haven't exceeded daily/monthly voice processing limits

Debug Mode: Enable voice debug mode in bot settings to get detailed logs of voice processing steps.

Command Studio Advanced Features