Bringing Interpretability to Neural Audio Codecs

Interspeech 2025

Samir Sadok1,*, Julien Hauret2,3,*, Éric Bavu2

1 Inria, Université Grenoble Alpes, CNRS, LJK, France
2 LMSSC, Conservatoire national des arts et métiers (Cnam), Paris, France
3 APC, French-German Research Institute of Saint-Louis, France

* Equal contribution


LMSSC    Inria    ISL

Context

Codecs overview

Neural Audio Codecs Timeline

A diverse subset

DAC
Kumar, R., et al. "High-fidelity audio compression with improved rvqgan." NeurIPS, 2023.
SpeechTokenizer
Zhang, X. et al. "Unified speech tokenizer for speech large language models." ICLR, 2024.
BigCodec
Xin, D. et al. "Pushing the limits of low-bitrate neural speech codec." arXiv:2409.05377, 2024.
Mimi
Défossez, A. et al. "Moshi: a speech-text foundation model for real-time dialogue." arXiv:2410.00037, 2024.
Specifications

Codec Sampling Rate (kHz) Token Rate (Hz) Codebook Cardinality Number of RVQ Scales
DAC 16, 24 or 44.1 86 1024 9
SpeechTokenizer 16 50 1024 8
BigCodec 16 80 8192 1
Mimi 24 12.5 2048 1 + 31

Analysis

Where are speech attributes encoded in neural audio codecs?

Speech attributes

Content

Dendogram of the discrete HuBERT speech units. from van Niekerk, B. et al. "Rhythm modeling for voice conversion." IEEE Signal Processing Letters 2023

Deterministic mapping between HuBERT's and codec's token ?

HuBERT and codecs tokens associations on librispeech-test-clean

t-SNE visualisation of content


Color is attributed via codec-to-HuBERT and HuBERT-to-sound mappings.

t-SNE visualisation of identity


Each dot corresponds to one utterance averaged over time, each color to one speaker.

t-SNE visualisation of pitch


Each dot corresponds to a token associated at least once with a specific (vowel) HuBERT token.

Mutual Information (MI) between tokens and speech attributes
MI estimate with Contrastive Log-ratio Upper Bound (CLUB) between SpeechTokenizer and speech attributes on librispeech-test-clean

Synthesis

How to analyze and control audio from codec tokens with AnCoGen?

Ancogen-Melspectrogram : principle
Ancogen-Melspectrogram : architecture
Ancogen-Codec : architecture
Ancogen-Codec : architecture

Sadok S., Leglaive S., Girin L., Richard G. & Alameda-Pineda X.
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. ICASSP 2025, Hyderabad, India
Sadok* S., Hauret* J., & Bavu É. (*: equal contributions)
Bringing interpretability to Neural Audio Codecs. Submitted to Interspeech 2025, Rotterdam, Netherlands
ℹ️ Speech attributes used for training

Resynthesis results

  • Task: :
    • Predict speech attributes and reconstruct high-quality audio from attributes alone
    • Comparison with the audio from the codec alone
  • Metrics :
    • Noresqa-MOS (↑), DNSMOS-BAK (↑), STOI (↑), speechBERTscore (↑)
    • Test dataset : Librispeech
AncoGen-Melspectrogram
AncoGen-BigCodec
AncoGen-SpeechTokenizer
Original
Speaker Identity switch: principle

Speaker Identity switch: principle

Speaker Identity Switch Results

  • Task:
    • Modify speaker identity while preserving linguistic content (Speaker Identity Switch)
  • Metrics:
    • Noresqa-MOS (↑, <5), Cosine Similarity (↑, <1) using Resemblyzer embeddings
    • Test dataset: Librispeech
AncoGen-Melspectrogram
AncoGen-BigCodec
AncoGen-SpeechTokenizer
Source
Target


Thank you for your attention




samir.sadok@inria.fr · julien.hauret@lecnam.net · eric.bavu@lecnam.net

LMSSC    Inria    ISL

ISL