Explainable

Bringing Interpretability to Neural Audio Codecs

Interspeech 2025

Samir Sadok^1,*, Julien Hauret^2,3,*, Éric Bavu²

¹ Inria, Université Grenoble Alpes, CNRS, LJK, France
² LMSSC, Conservatoire national des arts et métiers (Cnam), Paris, France
³ APC, French-German Research Institute of Saint-Louis, France

* Equal contribution

LMSSC Inria ISL

Codecs overview

Neural Audio Codecs Timeline

A diverse subset

DAC

Kumar, R., et al. "High-fidelity audio compression with improved rvqgan." *NeurIPS*, 2023.

SpeechTokenizer

Zhang, X. et al. "Unified speech tokenizer for speech large language models." *ICLR*, 2024.

BigCodec

Xin, D. et al. "Pushing the limits of low-bitrate neural speech codec." *arXiv:2409.05377*, 2024.

Mimi

Défossez, A. et al. "Moshi: a speech-text foundation model for real-time dialogue." *arXiv:2410.00037*, 2024.

Specifications

Codec	Sampling Rate (kHz)	Token Rate (Hz)	Codebook Cardinality	Number of RVQ Scales
DAC	16, 24 or 44.1	86	1024	9
SpeechTokenizer	16	50	1024	8
BigCodec	16	80	8192	1
Mimi	24	12.5	2048	1 + 31

Analysis

Where are speech attributes encoded in neural audio codecs?

Speech attributes

Content

Dendogram of the discrete HuBERT speech units. from van Niekerk, B. et al. "Rhythm modeling for voice conversion." IEEE Signal Processing Letters 2023

Deterministic mapping between HuBERT's and codec's token ?

HuBERT and codecs tokens associations on librispeech-test-clean

t-SNE visualisation of content

Color is attributed via codec-to-HuBERT and HuBERT-to-sound mappings.

t-SNE visualisation of identity

Each dot corresponds to one utterance averaged over time, each color to one speaker.

t-SNE visualisation of pitch

Each dot corresponds to a token associated at least once with a specific (vowel) HuBERT token.

Mutual Information (MI) between tokens and speech attributes

MI estimate with Contrastive Log-ratio Upper Bound (CLUB) between SpeechTokenizer and speech attributes on librispeech-test-clean

Synthesis

How to analyze and control audio from codec tokens with AnCoGen?

Ancogen-Melspectrogram : principle

Ancogen-Melspectrogram : architecture

Ancogen-Codec : architecture

Sadok S., Leglaive S., Girin L., Richard G. & Alameda-Pineda X.
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. ICASSP 2025, Hyderabad, India Sadok* S., Hauret* J., & Bavu É. (*: equal contributions)
Bringing interpretability to Neural Audio Codecs. Submitted to Interspeech 2025, Rotterdam, Netherlands

ℹ️ Speech attributes used for training

Pitch : CREPE (https://github.com/marl/crepe )
Content : HuBERT (https://github.com/bshall/hubert )
Identity : ECAPA-TDNN (https://github.com/TaoRuijie/ECAPA-TDNN )
Loudness : sliding RMS value of the signal

Resynthesis results

Task: :
- Predict speech attributes and reconstruct high-quality audio from attributes alone
- Comparison with the audio from the codec alone
Metrics :
- Noresqa-MOS (↑), DNSMOS-BAK (↑), STOI (↑), speechBERTscore (↑)
- Test dataset : Librispeech

AncoGen-Melspectrogram

AncoGen-BigCodec

AncoGen-SpeechTokenizer

Original

Speaker Identity switch: principle

Speaker Identity Switch Results

Task:
- Modify speaker identity while preserving linguistic content (Speaker Identity Switch)
Metrics:
- Noresqa-MOS (↑, <5), Cosine Similarity (↑, <1) using Resemblyzer embeddings
- Test dataset: Librispeech