PAL: Probing Audio Encoders via LLMs

The Core Question: How Can an LLM Best Probe Audio?

🧠 LLM Probing Audio Representations:
We view audio-LLM interaction as more than passive input processing; it is an active process in which the LLM probes the audio representation to find the answer to the user's query: "Which animal sound is heard in the audio?"
🔍 A Deeper Inquiry, Guided by Principles:
Instead of just connecting an audio encoder to an LLM, we ask a more fundamental question, guided by mechanistic interpretability and LLM operation principles: How can we design this connection so the LLM can actively and efficiently probe audio representations for relevant information?. To answer this, we derive three architectural hypotheses that guide our design choices.
🧪 Fair and Rigorous Architectural Comparisons:
We evaluate each architectural hypothesis using a standardized three-stage training curriculum, ensuring consistency across data (publicly available), batch size, epochs, and overall training design.

Key Findings

Our systematic investigation led to three key findings that significantly enhance audio-LLM performance, yielding improvements of 10–60% over the baseline.

Delayed Fusion: Introducing audio representations after initial text-only processing layers allows the LLM to establish a textual context first. This leads to more effective querying for relevant audio information from the audio representations by the LLM and results in improved overall performance compared to early integration.
Attention-Only Processing: Integrating audio information exclusively via an LLM layer's attention submodule, without propagating it to the layer's FFN submodule, is sufficient for the LLM to effectively probe audio representations and engage relevant pathways during FFN processing to satisfy textual queries. This method maintains or improves performance and may preserve the FFN's pre-trained linguistic knowledge while still facilitating cross-modal information transfer.
Multi-Encoder Ensembling: Multi-audio encoder ensembles broaden an LLM's audio querying capabilities, enabling it to query a wide range of audio information. Employing a diverse ensemble of audio encoders, which combines self-supervised for fine-grained details, language-aligned for better language alignment, and domain-specific for areas like music and speech, provides richer, complementary representations. Notably, we demonstrate an efficient integration mechanism for such ensembles that avoids increasing the LLM's input token count, leading to improved overall performance.

Sample Model Outputs

We compare PAL against the SOTA audio-LLM works LTU and GAMA. These models are trained on a comparable amount of data, but feature larger LLMs (7B vs. PAL's 1B) and more training epochs (6 vs. PAL's 3). Demo samples are taken from the AVE and AVS datasets.

General Audio Understanding
Music Perception
Speech & Emotion

Sample 1: Goat Bleating

Question: Analyze this audio clip. Describe the sounds you hear in as much detail as possible. What does the overall soundscape suggest about the environment or the activity taking place?

PAL (Llama3.2-1B)(Ours): The audio clip features a single sound event - a goat bleating. The sound is high pitched and nasal, which suggests that the goat is likely young or small. The sound is also relatively short, which suggest that the goat is not in distress or danger. The overall soundscape suggests that the environment is relatively quiet and peaceful, with no other significant sounds or disturbances.

GAMA (Llama2-7B): The sounds of the goat and the sheep suggest that they are in a rural or agricultural environment. The sounds of the machinery suggest that there is some kind of activity or work being done in the area. The overall soundscape suggests a busy and active environment with multiple sources of noise and activity.

LTU (Vicuna1.5-7B): The overall soundscape suggests that there are sheep in the vicinity, and that they are likely grazing or being tended to by a shepherd. The bleating of the sheep is a common sound in rural environments, and the high-pitched and nasal quality of the sound suggests that the sheep are in distress or seeking attention.

Training Details

Language Model: We used Llama 3.2 - 1 B for our experiments reported in the paper.

Training Data: To ensure fair and rigorous comparisons, all experiments use the publicly available OpenAQA-5M dataset introduced by LTU , which comprises approximately 5.6 million audio-text pairs.

Training Curriculum Overview

Acknowledgements

This research was supported by the EPSRC-BBC Prosperity Partnership ‘AI4ME: Future Personalised Object-Based Media Experiences Delivered at Scale Anywhere’ (EP/V038087/1).