๐๏ธ ACL-25 SpeechIQ Leaderboard
๐ฏ Welcome to the SpeechIQ Leaderboard!
This leaderboard presents evaluation results for voice understanding large language models (LLMVoice) using our novel SpeechIQ evaluation framework.
SpeechIQ is a human cognition-inspired evaluation pipeline that assesses voice understanding abilities across three cognitive levels based on Bloom's Taxonomy:
- ๐ง Remembering: Verbatim accuracy (WER-based)
- ๐ก Understanding: Similarity of LLM's interpretations
- ๐ Application: QA accuracy for downstream tasks
The Speech IQ Score provides a unified metric for comparing both cascaded methods (ASR+LLM) and end-to-end models.
๐ฅ | 108.64 | -1.885 | -1.604 | -1.146 | Agentic: ASR + GER + LLM | Whisper_v2-1.5B + GPT-4o + Qwen2_7B | OWSM-CTC_v3.1-1B |
๐ Column Explanations
- Rank: Position ranking with ๐ฅ๐ฅ๐ฅ medals for top 3 performers
- Speech IQ: Overall intelligence quotient combining all dimensions (primary metric)
- Remember: Verbatim accuracy score (WER-based)
- Understand: Semantic interpretation similarity score
- Apply: Downstream task performance score
- Model Type: Architecture approach (Agentic vs End2End)
- Setup: Specific model configuration and components
- Audio Encoder: The audio processing component used
Higher scores indicate better performance across all metrics.
๐ Model Type Comparison
Agentic Models (ASR + LLM):
- Count: 5
- Average Speech IQ: 106.68
- Best Score: 108.64
End-to-End Models:
- Count: 8
- Average Speech IQ: 96.16
- Best Score: 107.85
๐ฏ Cognitive Dimension Analysis
Remember (Verbatim Accuracy):
- Best performer: Canary_1B + Qwen2_7B (0.559)
Understand (Semantic Similarity):
- Best performer: Gemini-1.5-flash (0.641)
Apply (Task Performance):
- Best performer: Gemini-1.5-pro (0.710)
๐ Leaderboard Statistics
Metric | Value |
---|---|
๐ Top Performer | Whisper_v2-1.5B + GPT-4o + Qwen2_7B |
๐ฏ Highest Score | 108.64 |
๐ค Best Agentic Model | 108.64 |
๐ Best End2End Model | 107.85 |
๐ Total Models | 13 |
๐ About SpeechIQ Evaluation
Speech Intelligence Quotient (SpeechIQ) represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks. Our framework moves beyond traditional metrics like Word Error Rate (WER) to provide comprehensive evaluation of voice understanding capabilities.
๐ฏ Evaluation Framework
SpeechIQ evaluates models across three cognitive dimensions inspired by Bloom's Taxonomy:
- Remember (Verbatim Accuracy): Tests the model's ability to accurately capture spoken content
- Understand (Interpretation Similarity): Evaluates how well the model comprehends the meaning of speech
- Apply (Downstream Performance): Measures the model's ability to use speech understanding for practical tasks
๐ Model Categories
- Agentic (ASR + LLM): Cascaded approaches using separate ASR and LLM components
- End2End: Direct speech-to-text models that process audio end-to-end
๐ฌ Key Benefits
- Unified Comparisons: Compare cascaded and end-to-end approaches on equal footing
- Error Detection: Identify annotation errors in existing benchmarks
- Hallucination Detection: Detect and quantify hallucinations in voice LLMs
- Cognitive Assessment: Map model capabilities to human cognitive principles
๐ Speech IQ Score
The final Speech IQ Score combines performance across all three dimensions to provide a comprehensive measure of voice understanding intelligence.
๐ Reproducibility
For detailed methodology and reproduction instructions, please refer to our paper and codebase.
๐ Submit Your Model for SpeechIQ Evaluation
To submit your voice understanding model for SpeechIQ evaluation:
1) Ensure Model Compatibility
Make sure your model can process audio inputs and generate text outputs in one of these formats:
- ASR + LLM: Separate ASR and LLM components
- End-to-End: Direct audio-to-text processing
2) Model Requirements
- Model must be publicly accessible
- Provide clear documentation of audio input format and expected outputs
- Include information about audio encoder specifications
3) Evaluation Domains
Your model will be evaluated across:
- Remember: Transcription accuracy
- Understand: Semantic understanding
- Apply: Task-specific performance
4) Documentation
Please provide:
- Model architecture details
- Training data information
- Audio preprocessing requirements
- Expected input/output formats
๐ง Contact
For questions about SpeechIQ evaluation or to submit your model, please contact the research team.