PAL: Probing Audio Encoders via LLMs

A Study of Information Transfer from Audio Encoders to LLMs

1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK 2Surrey Institute for People-Centred AI (PAI), University of Surrey, UK 3Mohamed bin Zayed University of AI (MBZUAI), Abu Dhabi, UAE *Work done during internship at MBZUAI.
Research Question Overview

TL;DR: PAL investigates and explores strategies for integrating audio encoders with LLMs, focusing on efficient cross-modal information transfer. Guided by hypotheses derived from mechanistic interpretability studies and the operational principles of LLMs.

The Core Question: How Can an LLM Best Probe Audio?

  • 🧠 LLM Probing Audio Representations:
    We view audio-LLM interaction as more than passive input processing; it is an active process in which the LLM probes the audio representation to find the answer to the user's query: "Which animal sound is heard in the audio?"
  • 🔍 A Deeper Inquiry, Guided by Principles:
    Instead of just connecting an audio encoder to an LLM, we ask a more fundamental question, guided by mechanistic interpretability and LLM operation principles: How can we design this connection so the LLM can actively and efficiently probe audio representations for relevant information?. To answer this, we derive three architectural hypotheses that guide our design choices.
  • 🧪 Fair and Rigorous Architectural Comparisons:
    We evaluate each architectural hypothesis using a standardized three-stage training curriculum, ensuring consistency across data (publicly available), batch size, epochs, and overall training design.

Key Findings

PAL Hypotheses Diagram

Our systematic investigation led to three key findings that significantly enhance audio-LLM performance, yielding improvements of 10–60% over the baseline.

  1. Delayed Fusion: Introducing audio representations after initial text-only processing layers allows the LLM to establish a textual context first. This leads to more effective querying for relevant audio information from the audio representations by the LLM and results in improved overall performance compared to early integration.
  2. Attention-Only Processing: Integrating audio information exclusively via an LLM layer's attention submodule, without propagating it to the layer's FFN submodule, is sufficient for the LLM to effectively probe audio representations and engage relevant pathways during FFN processing to satisfy textual queries. This method maintains or improves performance and may preserve the FFN's pre-trained linguistic knowledge while still facilitating cross-modal information transfer.
  3. Multi-Encoder Ensembling: Multi-audio encoder ensembles broaden an LLM's audio querying capabilities, enabling it to query a wide range of audio information. Employing a diverse ensemble of audio encoders, which combines self-supervised for fine-grained details, language-aligned for better language alignment, and domain-specific for areas like music and speech, provides richer, complementary representations. Notably, we demonstrate an efficient integration mechanism for such ensembles that avoids increasing the LLM's input token count, leading to improved overall performance.

Sample Model Outputs

We compare PAL against the SOTA audio-LLM works LTU and GAMA. These models are trained on a comparable amount of data, but feature larger LLMs (7B vs. PAL's 1B) and more training epochs (6 vs. PAL's 3). Demo samples are taken from the AVE and AVS datasets.

Sample 1: Goat Bleating

Question: Analyze this audio clip. Describe the sounds you hear in as much detail as possible. What does the overall soundscape suggest about the environment or the activity taking place?

PAL (Llama3.2-1B)(Ours): The audio clip features a single sound event - a goat bleating. The sound is high pitched and nasal, which suggests that the goat is likely young or small. The sound is also relatively short, which suggest that the goat is not in distress or danger. The overall soundscape suggests that the environment is relatively quiet and peaceful, with no other significant sounds or disturbances.
GAMA (Llama2-7B): The sounds of the goat and the sheep suggest that they are in a rural or agricultural environment. The sounds of the machinery suggest that there is some kind of activity or work being done in the area. The overall soundscape suggests a busy and active environment with multiple sources of noise and activity.
LTU (Vicuna1.5-7B): The overall soundscape suggests that there are sheep in the vicinity, and that they are likely grazing or being tended to by a shepherd. The bleating of the sheep is a common sound in rural environments, and the high-pitched and nasal quality of the sound suggests that the sheep are in distress or seeking attention.

Training Details

Language Model: We used Llama 3.2 - 1 B for our experiments reported in the paper.

Training Data: To ensure fair and rigorous comparisons, all experiments use the publicly available OpenAQA-5M dataset introduced by LTU , which comprises approximately 5.6 million audio-text pairs.

Training Curriculum Overview

Training Curriculum Diagram

Acknowledgements

This research was supported by the EPSRC-BBC Prosperity Partnership ‘AI4ME: Future Personalised Object-Based Media Experiences Delivered at Scale Anywhere’ (EP/V038087/1).