PAL: Probing Audio Encoders via LLMs

Audio Information Transfer into LLMs

1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 2Surrey Institute for People Centered AI, University of Surrey, UK 3Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE
PAL Overview

TL;DR: SOTA audio-LLMs use PLITS (Prepend to the LLM's Input Token Space) audio LLM integration. We study how to transfer rich audio semantics from encoders to LLMs efficiently, introduce LAL — an attention-only integration that injects audio as K/V (so audio never traverses FFNs) — and present PAL, an encoder-aware hybrid that uses PLITS for Whisper (speech) and LAL for general audio encoders.

Contributions

  • 💡 LAL (Lightweight Audio LLM Integration): A minimal-compute path that injects audio only as attention keys/values while keeping queries text-only and bypassing FFNs for audio. This changes the forward path (unlike LoRA/PEFT) so efficiency gains persist at inference.
  • 🎶 PAL (Encoder-Aware Hybrid): A practical system that applies PLITS (Prepend to the LLM's Input Token Space) to Whisper (speech benefits from in-LLM decoding) and LAL to general audio encoders (e.g., SSLAM, CLAP), achieving strong quality across speech, music, and general audio with substantially improved efficiency.
  • 📊 Fair Comparisons & Training Protocol: Standardized curriculum and shared data/hyperparameters for controlled PLITS (Prepend to the LLM's Input Token Space) vs. LAL vs. PAL comparisons; ablations include a frozen-FFN variant demonstrating LAL’s effectiveness without updating FFNs.
PAL architecture and routing
Figure 1. PAL architecture and routing (PLITS for Whisper, LAL for general audio).

LAL: Lightweight Audio LLM Integration

  • ⚙️ Mechanism: Text-only queries attend over concatenated text+audio keys/values. Audio never forms queries and never traverses FFNs.
  • 📉 Complexity: PLITS attention is O((Na+Nt)^2). LAL reduces this to O((Na+Nt)·Nt), removing the Na^2 term; FFN compute/memory for audio is eliminated.
  • 🚀 Efficiency: In our setup we observed up to 64.1% lower memory and up to 247.5% higher training throughput vs. PLITS, with similar or better task accuracy.
  • 🧠 Contextual and Parametric Knowledge: Attention supplies contextual cues that modulate text tokens, which then engage the LLM’s parametric knowledge inside FFNs—no need to push audio through FFNs.
  • 🏗️ Not PEFT/LoRA: LAL is an architectural routing change. LoRA/PEFT adapts weights but keeps the same forward path; LAL’s gains persist at inference.
LAL vs PLITS comparison 1x4
Figure 2. Detailed LAL vs PLITS comparison (1x4 layout).

PLITS vs. LAL (Quick Comparison)

Aspect PLITS LAL
How audio is added Prepended as tokens to text; processed by every layer Injected as K/V only; text-only queries
Attention complexity O((Na+Nt)^2) O((Na+Nt)·Nt)
Audio through FFN Yes (all layers) No (bypassed)
Memory & FLOPs Higher with longer audio (Na) Lower; scales linearly in Na for attention and skips FFNs
Knowledge usage Context + parametric (both via audio+text through blocks) Context (attention) → activates parametric (FFN on text)

PAL: Encoder-Aware Hybrid

PAL selects the best path per encoder: PLITS for Whisper (speech benefits from in-LLM decoding) and LAL for general audio encoders (e.g., SSLAM, CLAP). This keeps quality on speech, music, and general audio while substantially improving efficiency compared with a pure PLITS system.

  • 🗣️ Speech (Whisper): Use PLITS to leverage decoding dynamics inside the LLM.
  • 🎶 General audio/music (SSLAM/CLAP): Use LAL for linear-in-Na attention and no FFN overhead on audio.
  • ✅ Result: One model, encoder-aware routing, strong accuracy with lower memory and higher throughput.

Acknowledgements

This research was supported by the EPSRC and BBC Prosperity Partnership “AI4ME: Future Personalized Object Based Media Experiences Delivered at Scale Anywhere” (EP/V038087/1). Part of the experiments used resources provided by the EuroHPC Joint Undertaking with access to the LEONARDO EuroHPC supercomputer hosted by CINECA and consortium resources.