๐ŸŽ™๏ธ ACL-25 SpeechIQ Leaderboard

๐ŸŽฏ Welcome to the SpeechIQ Leaderboard!

This leaderboard presents evaluation results for voice understanding large language models (LLMVoice) using our novel SpeechIQ evaluation framework.

SpeechIQ is a human cognition-inspired evaluation pipeline that assesses voice understanding abilities across three cognitive levels based on Bloom's Taxonomy:

  • ๐Ÿง  Remembering: Verbatim accuracy (WER-based)
  • ๐Ÿ’ก Understanding: Similarity of LLM's interpretations
  • ๐Ÿš€ Application: QA accuracy for downstream tasks

The Speech IQ Score provides a unified metric for comparing both cascaded methods (ASR+LLM) and end-to-end models.

๐Ÿฅ‡
108.64
-1.885
-1.604
-1.146
Agentic: ASR + GER + LLM
Whisper_v2-1.5B + GPT-4o + Qwen2_7B
OWSM-CTC_v3.1-1B

๐Ÿ“‹ Column Explanations

  • Rank: Position ranking with ๐Ÿฅ‡๐Ÿฅˆ๐Ÿฅ‰ medals for top 3 performers
  • Speech IQ: Overall intelligence quotient combining all dimensions (primary metric)
  • Remember: Verbatim accuracy score (WER-based)
  • Understand: Semantic interpretation similarity score
  • Apply: Downstream task performance score
  • Model Type: Architecture approach (Agentic vs End2End)
  • Setup: Specific model configuration and components
  • Audio Encoder: The audio processing component used

Higher scores indicate better performance across all metrics.

๐Ÿ“Š Leaderboard Statistics

Metric Value
๐Ÿ† Top Performer Whisper_v2-1.5B + GPT-4o + Qwen2_7B
๐ŸŽฏ Highest Score 108.64
๐Ÿค– Best Agentic Model 108.64
๐Ÿ”„ Best End2End Model 107.85
๐Ÿ“ˆ Total Models 13