Friendify's Voice System provides comprehensive voice processing capabilities, enabling your Discord bots to understand spoken messages and respond with natural-sounding speech.
Key Features
Advanced Speech Recognition: Multiple Whisper models for accurate transcription
High-Quality TTS: Natural-sounding text-to-speech in multiple languages
Real-time Processing: Low-latency voice processing for responsive interactions
Noise Filtering: Automatic background noise reduction and audio enhancement
Multi-language Support: Support for Turkish, English, and many other languages
Smart Audio Detection: Automatic detection of speech vs. background noise
Performance Optimized: Our voice system is optimized for Discord's voice infrastructure with minimal latency and high reliability.
Voice Processing Pipeline
Understanding how voice is processed helps you optimize your bot's voice interactions.
Audio Capture
Capture voice from Discord channels
Audio Processing
Noise reduction and enhancement
Transcription
Convert speech to text using Whisper
AI Response
Generate and speak AI response
Audio Quality Enhancement
Noise Suppression: Remove background noise and echo
Volume Normalization: Automatic gain control for consistent levels
Frequency Filtering: Optimize audio for speech recognition
Format Conversion: Support for various audio formats and codecs
Audio Processing Settings:
- Sample Rate: 16kHz (optimized for speech)
- Bit Depth: 16-bit
- Channels: Mono (converted from stereo if needed)
- Format: WAV/PCM for processing
- Noise Gate: -40dB threshold
Speech-to-Text (Whisper)
Friendify uses OpenAI's Whisper models for accurate speech transcription with multiple model options for different use cases.
Whisper Tiny
Speed: ~10x realtime Size: 39MB
Fast, basic accuracy
Whisper Base
Speed: ~7x realtime Size: 74MB
Balanced speed/quality
Whisper Small
Speed: ~4x realtime Size: 244MB
High accuracy (recommended)
Online Whisper
Speed: ~2x realtime API-based
Highest accuracy
Transcription Features
Multi-language Detection: Automatic language identification
Punctuation and Capitalization: Proper text formatting
Speaker Recognition: Distinguish between different speakers
Timestamp Generation: Word-level timing information
Confidence Scoring: Quality assessment of transcriptions
Model Selection Strategy
Friendify automatically selects the best model based on:
Audio Quality: Higher quality audio uses more accurate models
Length: Shorter clips use faster models for responsiveness
Language: Some models perform better for specific languages
User Preferences: Manual model selection in settings
Fallback Chain: Automatic fallback if primary model fails
Smart Fallback: If the primary model fails or produces low-confidence results, the system automatically tries alternative models.
Text-to-Speech (TTS)
Generate natural-sounding speech responses with various voice options and customization settings.
Neural Emma (EN)
Female, American English
Clear pronunciation
Natural intonation
Professional tone
Neural Ryan (EN)
Male, American English
Deep, warm voice
Confident delivery
Versatile tone
Neural Emel (TR)
Female, Turkish
Native Turkish speaker
Expressive delivery
Cultural awareness
Neural Ahmet (TR)
Male, Turkish
Clear articulation
Friendly tone
Natural accent
TTS Customization
Speech Rate: Adjust speaking speed (0.5x to 2.0x)
Pitch Control: Modify voice pitch (-50% to +50%)
Volume: Set output volume level
Emphasis: Add stress to specific words
Pauses: Insert custom pauses and breaks
Pronunciation: Custom pronunciation for specific terms
SSML Example:
<speak>
<prosody rate="1.2" pitch="+10%">
Merhaba! <break time="500ms"/>
Bugün <emphasis level="strong">harika</emphasis> bir gün!
</prosody>
</speak>
Audio Output Options
Bitrate: 128kbps, 192kbps, or 320kbps quality
Format: MP3, WAV, or OGG output formats
Stereo/Mono: Channel configuration options
Normalization: Automatic volume leveling
Compression: Dynamic range compression for Discord
Voice Settings and Configuration
Customize your bot's voice behavior through comprehensive settings in the dashboard.
Auto Voice Detection
Automatically join voice channels when users speak
Enabled
Voice Response
Respond with voice instead of text when in voice channels
Enabled
Noise Gate Threshold
Minimum audio level to trigger transcription
Maximum Recording Length
Maximum duration for voice recordings
Preferred Whisper Model
Default transcription model for voice processing
TTS Voice
Default text-to-speech voice for responses
Performance Impact: Higher quality settings may increase processing time and resource usage. Test different configurations to find the optimal balance.
Multi-language Support
Friendify's voice system supports multiple languages with automatic detection and appropriate processing.
Supported Languages
Primary Support
Turkish (TR)
English (EN)
Additional Languages
German (DE)
French (FR)
Spanish (ES)
Italian (IT)
Portuguese (PT)
Russian (RU)
Language Detection
Automatic Detection: Whisper automatically identifies the spoken language
Confidence Scoring: Each detection includes a confidence level
Manual Override: Force specific language processing if needed
Mixed Language: Handle conversations with multiple languages
Fallback Languages: Preferred language order for ambiguous audio
Language Adaptation: The bot can adapt its personality and response style based on the detected language and cultural context.
Cross-language Features
Translation: Automatic translation between supported languages