O((Na+Nt)^2). LAL reduces this to O((Na+Nt)·Nt), removing the Na^2 term; FFN compute/memory for audio is eliminated.
| Aspect | PLITS | LAL |
|---|---|---|
| How audio is added | Prepended as tokens to text; processed by every layer | Injected as K/V only; text-only queries |
| Attention complexity | O((Na+Nt)^2) |
O((Na+Nt)·Nt) |
| Audio through FFN | Yes (all layers) | No (bypassed) |
| Memory & FLOPs | Higher with longer audio (Na) | Lower; scales linearly in Na for attention and skips FFNs |
| Knowledge usage | Context + parametric (both via audio+text through blocks) | Context (attention) → activates parametric (FFN on text) |
PAL selects the best path per encoder: PLITS for Whisper (speech benefits from in-LLM decoding) and LAL for general audio encoders (e.g., SSLAM, CLAP). This keeps quality on speech, music, and general audio while substantially improving efficiency compared with a pure PLITS system.
This research was supported by the EPSRC and BBC Prosperity Partnership “AI4ME: Future Personalized Object Based Media Experiences Delivered at Scale Anywhere” (EP/V038087/1). Part of the experiments used resources provided by the EuroHPC Joint Undertaking with access to the LEONARDO EuroHPC supercomputer hosted by CINECA and consortium resources.