Voice System Guide

Advanced voice processing, transcription, and text-to-speech capabilities for Discord bots

Table of Contents

1. Overview 2. Voice Processing Pipeline 3. Speech-to-Text (Whisper) 4. Text-to-Speech (TTS) 5. Voice Settings and Configuration 6. Multi-language Support 7. Usage Limits and Analytics 8. Troubleshooting

Overview

Friendify's Voice System provides comprehensive voice processing capabilities, enabling your Discord bots to understand spoken messages and respond with natural-sounding speech.

Key Features

  • Advanced Speech Recognition: Multiple Whisper models for accurate transcription
  • High-Quality TTS: Natural-sounding text-to-speech in multiple languages
  • Real-time Processing: Low-latency voice processing for responsive interactions
  • Noise Filtering: Automatic background noise reduction and audio enhancement
  • Multi-language Support: Support for Turkish, English, and many other languages
  • Smart Audio Detection: Automatic detection of speech vs. background noise
Performance Optimized: Our voice system is optimized for Discord's voice infrastructure with minimal latency and high reliability.

Voice Processing Pipeline

Understanding how voice is processed helps you optimize your bot's voice interactions.

Audio Capture

Capture voice from Discord channels

Audio Processing

Noise reduction and enhancement

Transcription

Convert speech to text using Whisper

AI Response

Generate and speak AI response

Audio Quality Enhancement

  • Noise Suppression: Remove background noise and echo
  • Volume Normalization: Automatic gain control for consistent levels
  • Frequency Filtering: Optimize audio for speech recognition
  • Silence Detection: Automatically detect speech boundaries
  • Format Conversion: Support for various audio formats and codecs
Audio Processing Settings: - Sample Rate: 16kHz (optimized for speech) - Bit Depth: 16-bit - Channels: Mono (converted from stereo if needed) - Format: WAV/PCM for processing - Noise Gate: -40dB threshold

Speech-to-Text (Whisper)

Friendify uses OpenAI's Whisper models for accurate speech transcription with multiple model options for different use cases.

Whisper Tiny
Speed: ~10x realtime
Size: 39MB

Fast, basic accuracy

Whisper Base
Speed: ~7x realtime
Size: 74MB

Balanced speed/quality

Whisper Small
Speed: ~4x realtime
Size: 244MB

High accuracy (recommended)

Online Whisper
Speed: ~2x realtime
API-based

Highest accuracy

Transcription Features

  • Multi-language Detection: Automatic language identification
  • Punctuation and Capitalization: Proper text formatting
  • Speaker Recognition: Distinguish between different speakers
  • Timestamp Generation: Word-level timing information
  • Confidence Scoring: Quality assessment of transcriptions

Model Selection Strategy

Friendify automatically selects the best model based on:

  • Audio Quality: Higher quality audio uses more accurate models
  • Length: Shorter clips use faster models for responsiveness
  • Language: Some models perform better for specific languages
  • User Preferences: Manual model selection in settings
  • Fallback Chain: Automatic fallback if primary model fails
Smart Fallback: If the primary model fails or produces low-confidence results, the system automatically tries alternative models.

Text-to-Speech (TTS)

Generate natural-sounding speech responses with various voice options and customization settings.

Neural Emma (EN)

Female, American English

  • Clear pronunciation
  • Natural intonation
  • Professional tone

Neural Ryan (EN)

Male, American English

  • Deep, warm voice
  • Confident delivery
  • Versatile tone

Neural Emel (TR)

Female, Turkish

  • Native Turkish speaker
  • Expressive delivery
  • Cultural awareness

Neural Ahmet (TR)

Male, Turkish

  • Clear articulation
  • Friendly tone
  • Natural accent

TTS Customization

  • Speech Rate: Adjust speaking speed (0.5x to 2.0x)
  • Pitch Control: Modify voice pitch (-50% to +50%)
  • Volume: Set output volume level
  • Emphasis: Add stress to specific words
  • Pauses: Insert custom pauses and breaks
  • Pronunciation: Custom pronunciation for specific terms
SSML Example: <speak> <prosody rate="1.2" pitch="+10%"> Merhaba! <break time="500ms"/> Bugün <emphasis level="strong">harika</emphasis> bir gün! </prosody> </speak>

Audio Output Options

  • Bitrate: 128kbps, 192kbps, or 320kbps quality
  • Format: MP3, WAV, or OGG output formats
  • Stereo/Mono: Channel configuration options
  • Normalization: Automatic volume leveling
  • Compression: Dynamic range compression for Discord

Voice Settings and Configuration

Customize your bot's voice behavior through comprehensive settings in the dashboard.

Auto Voice Detection
Automatically join voice channels when users speak
Enabled
Voice Response
Respond with voice instead of text when in voice channels
Enabled
Noise Gate Threshold
Minimum audio level to trigger transcription
Maximum Recording Length
Maximum duration for voice recordings
Preferred Whisper Model
Default transcription model for voice processing
TTS Voice
Default text-to-speech voice for responses
Performance Impact: Higher quality settings may increase processing time and resource usage. Test different configurations to find the optimal balance.

Multi-language Support

Friendify's voice system supports multiple languages with automatic detection and appropriate processing.

Supported Languages

Primary Support

  • Turkish (TR)
  • English (EN)

Additional Languages

  • German (DE)
  • French (FR)
  • Spanish (ES)
  • Italian (IT)
  • Portuguese (PT)
  • Russian (RU)

Language Detection

  • Automatic Detection: Whisper automatically identifies the spoken language
  • Confidence Scoring: Each detection includes a confidence level
  • Manual Override: Force specific language processing if needed
  • Mixed Language: Handle conversations with multiple languages
  • Fallback Languages: Preferred language order for ambiguous audio
Language Adaptation: The bot can adapt its personality and response style based on the detected language and cultural context.

Cross-language Features

  • Translation: Automatic translation between supported languages
  • Code-switching: Handle mixed-language conversations
  • Cultural Adaptation: Adjust responses for cultural appropriateness
  • Accent Recognition: Better recognition of regional accents

Usage Limits and Analytics

Monitor and manage your voice processing usage with built-in analytics and configurable limits.

Usage Tracking

  • Daily Limits: Track daily voice processing minutes
  • Monthly Quotas: Monitor monthly usage across all bots
  • Per-Bot Limits: Set individual limits for each bot
  • User Quotas: Limit voice usage per Discord user
  • Quality Metrics: Track transcription accuracy and success rates
Default Limits: - Free Tier: 100 minutes/day, 1000 minutes/month - Premium: 500 minutes/day, 5000 minutes/month - Enterprise: Custom limits available - Per-user: 20 minutes/day per Discord user

Analytics Dashboard

  • Usage Graphs: Visual representation of voice processing over time
  • Success Rates: Track transcription success and failure rates
  • Language Statistics: Distribution of processed languages
  • Performance Metrics: Average processing times and latencies
  • Cost Tracking: Monitor API costs and resource usage
Quota Management: When limits are reached, voice processing is temporarily disabled. Users will receive text responses instead.

Troubleshooting

Common voice system issues and their solutions.

Audio Quality Issues

  • Poor Transcription: Check microphone quality and background noise levels
  • Bot Not Responding to Voice: Verify voice detection settings and noise gate threshold
  • Choppy Audio Output: Check network connection and Discord voice quality settings
  • Wrong Language Detection: Manually set preferred language in bot settings

Performance Issues

  • Slow Transcription: Switch to faster Whisper model (Tiny or Base)
  • High Latency: Check server location and network conditions
  • Processing Failures: Verify audio format compatibility and file size limits
  • Memory Issues: Restart bot if processing long audio files

Common Fixes

  1. Restart Voice Connection: Have the bot leave and rejoin the voice channel
  2. Check Permissions: Ensure bot has "Connect" and "Speak" permissions
  3. Update Audio Drivers: Ensure Discord has proper audio device access
  4. Clear Audio Cache: Clear temporary audio files if experiencing issues
  5. Check Quota: Verify you haven't exceeded daily/monthly voice processing limits
Debug Mode: Enable voice debug mode in bot settings to get detailed logs of voice processing steps.