SpeechIQ Leaderboard


🥇	108.64	-1.885	-1.604	-1.146	Agentic: ASR + GER + LLM	Whisper_v2-1.5B + GPT-4o + Qwen2_7B	OWSM-CTC_v3.1-1B


🥇	108.64	0.543	0.632	0.487	Agentic: ASR + GER + LLM	Whisper_v2-1.5B + GPT-4o + Qwen2_7B	Whisper_v2-1.5B
🥈	107.85	-1.885	0.641	0.673	End2End	Gemini-1.5-flash	Google_USM
🥉	107.78	0.559	0.566	0.504	Agentic: ASR + LLM	Canary_1B + Qwen2_7B	Whisper_v2-1.5B
4	107.43	0.554	0.499	0.481	Agentic: ASR + LLM	Whisper_v2-1.5B + Qwen2_7B	Whisper_v2-1.5B
5	107.08	0.492	0.409	0.71	End2End	Gemini-1.5-pro	Google_USM
6	106.49	0.553	0.433	0.432	Agentic: ASR + LLM	Whisper_v3-1.5B + Qwen2_7B	Whisper_v2-1.5B
7	105.74	0.472	0.41	0.509	End2End	Qwen2.5-Omni_7B	1.5B Whisper
8	104.02	0.448	0.184	0.546	End2End	Baichuan-omni-1.5_7B	1.5B Whisper
9	103.88	-0.187	0.366	0.011	End2End	Qwen2-Audio_7B	1.5B Whisper
10	103.05	0.534	0.151	0.353	Agentic: ASR + LLM	OWSM-CTC_v3.1-1B + Qwen2_7B	OWSM-CTC_v3.1-1B
11	101.03	0.508	0.381	-1.146	End2End	Salmonn_13B	1.5B Whisper
12	79.69	-2.575	-1.604	-0.233	End2End	Desta2_8B	1.5B Whisper
13	60.02	0.314	-2.718	-2.893	End2End	AnyGPT_7B	SpeechTokenizer

📋 Column Explanations

Rank: Position ranking with 🥇🥈🥉 medals for top 3 performers
Speech IQ: Overall intelligence quotient combining all dimensions (primary metric)
Remember: Verbatim accuracy score (WER-based)
Understand: Semantic interpretation similarity score
Apply: Downstream task performance score
Model Type: Architecture approach (Agentic vs End2End)
Setup: Specific model configuration and components
Audio Encoder: The audio processing component used

Higher scores indicate better performance across all metrics.

🔍 Model Type Comparison

Agentic Models (ASR + LLM):

Count: 5
Average Speech IQ: 106.68
Best Score: 108.64

End-to-End Models:

Count: 8
Average Speech IQ: 96.16
Best Score: 107.85

🎯 Cognitive Dimension Analysis

Remember (Verbatim Accuracy):

Best performer: Canary_1B + Qwen2_7B (0.559)

Understand (Semantic Similarity):

Best performer: Gemini-1.5-flash (0.641)

Apply (Task Performance):

Best performer: Gemini-1.5-pro (0.710)

Metric	Value
🏆 Top Performer	Whisper_v2-1.5B + GPT-4o + Qwen2_7B
🎯 Highest Score	108.64
🤖 Best Agentic Model	108.64
🔄 Best End2End Model	107.85
📈 Total Models	13

📊 About SpeechIQ Evaluation

Speech Intelligence Quotient (SpeechIQ) represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks. Our framework moves beyond traditional metrics like Word Error Rate (WER) to provide comprehensive evaluation of voice understanding capabilities.

🎯 Evaluation Framework

SpeechIQ evaluates models across three cognitive dimensions inspired by Bloom's Taxonomy:

Remember (Verbatim Accuracy): Tests the model's ability to accurately capture spoken content
Understand (Interpretation Similarity): Evaluates how well the model comprehends the meaning of speech
Apply (Downstream Performance): Measures the model's ability to use speech understanding for practical tasks

🏆 Model Categories

Agentic (ASR + LLM): Cascaded approaches using separate ASR and LLM components
End2End: Direct speech-to-text models that process audio end-to-end

🔬 Key Benefits

Unified Comparisons: Compare cascaded and end-to-end approaches on equal footing
Error Detection: Identify annotation errors in existing benchmarks
Hallucination Detection: Detect and quantify hallucinations in voice LLMs
Cognitive Assessment: Map model capabilities to human cognitive principles

📈 Speech IQ Score

The final Speech IQ Score combines performance across all three dimensions to provide a comprehensive measure of voice understanding intelligence.

🔄 Reproducibility

For detailed methodology and reproduction instructions, please refer to our paper and codebase.

🚀 Submit Your Model for SpeechIQ Evaluation

To submit your voice understanding model for SpeechIQ evaluation:

1) Ensure Model Compatibility

Make sure your model can process audio inputs and generate text outputs in one of these formats:

ASR + LLM: Separate ASR and LLM components
End-to-End: Direct audio-to-text processing

2) Model Requirements

Model must be publicly accessible
Provide clear documentation of audio input format and expected outputs
Include information about audio encoder specifications

3) Evaluation Domains

Your model will be evaluated across:

Remember: Transcription accuracy
Understand: Semantic understanding
Apply: Task-specific performance

4) Documentation

Please provide:

Model architecture details
Training data information
Audio preprocessing requirements
Expected input/output formats

📧 Contact

For questions about SpeechIQ evaluation or to submit your model, please contact the research team.

🎙️ ACL-25 SpeechIQ Leaderboard

🎯 Welcome to the SpeechIQ Leaderboard!