EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones


Scientific Day 2023 – Computer Science X Circuits and Systems
ISEP campus - 15/06/2023

Julien HAURET, Thomas JOUBAUD, Véronique ZIMPFER, Éric BAVU
☆ : LMSSC, Conservatoire national des arts et métiers, Paris, France, HESAM Université
† : Department of Acoustics and Soldier Protection, French-German Research Institute of Saint-Louis (ISL)

       

Context

Context

BANG Prototype

New generation ear plug
Quiet conditions

Reference

In-ear


85dB noise

Reference

In-ear

Spectrograms

Degradation

Pseudo Quadrature Mirror Filters

Pseudo Quadrature Mirror Filter (PQMF) banks
Analysis and synthesis

In practice

EBEN : Extreme Bandwidth Extension Network

EBEN Generator
EBEN overview

Loss functions

\[ \mathcal{L_D}= \underbrace{E_y\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1-D_{k,t}(y))\right]}_\textrm{real adversarial} + \underbrace{E_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1+D_{k,t}(G(x)))\right]}_\textrm{fake adversarial} \] \[\mathcal{L_G}= \underbrace{E_x\left[ \frac{1}{K} \displaystyle \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1-D_{k,t}(G(x)))\right]}_\textrm{adversarial} + \underbrace{ E_x\left[ \frac{1}{K} \displaystyle \sum_{\substack{k \in [0,3] \\ l \in [1,L_k [ }} \frac{1}{T_{k,l}F_{k,l}} \displaystyle \sum_t \| D_{k,t}^{(l)}(y)-D_{k,t}^{(l)}(G(x))\| _{L_1} \right]}_\textrm{feature matching} \]

Results

Qualitative results: EBEN

Objective metrics

Speech PESQ SI-SDR STOI
Simulated In-ear 2.42 (0.34) 8.4 (3.7) 0.83 (0.05)
Audio U-net 2.24 (0.49) 11.9 (3.7) 0.87 (0.04)
Hifi-GAN v3 1.32 (0.16) -25.1 (11.4) 0.78 (0.04)
Seanet 1.92 (0.48) 11.1 (3.0) 0.89 (0.04)
Streaming Seanet 2.01 (0.46) 11.2 (3.6) 0.89 (0.04)
EBEN (ours) 2.08 (0.45) 10.9 (3.3) 0.89 (0.04)

MUSHRA

Frugality indicators

Speech \[P_{gen}\] \[P_{dis}\] \[\tau~\textrm{(ms)}\] \[\delta~\textrm{(MB)}\]
Audio U-net 71.0 M \[\emptyset\] 37.5 1117.3
Hifi-GAN v3 1.5 M 70.7 M 3.1 22.2
Seanet 8.3 M 56.6 M 13.1 89.2
Streaming Seanet 0.7 M 56.6 M 7.5 10.9
EBEN (ours) 1.9 M 27.8 M 4.3 20

Dataset

Dataset

Thank you for your attention


julien.hauret@lecnam.net | https://jhauret.github.io/eben/

Appendices

Signal processing equations

Source-Filter model \[y(t) = (h*x)(t) \] \[Y(f) = H(f).X(f) \]
Transfer function estimation \[H(f)=\frac{P_{xy}(f)}{P_{xx}(f)} \] \[C_{xy}(f)=\frac{|P_{xy}(f)|^2}{P_{xx}(f) . P_{yy}(f)} \]

References - 1

PQMF

  • Joseph Rothweiler: Polyphase quadrature filters–a new subband coding technique. In ICASSP 1983.
  • Truong Q Nguyen: Near-perfect-reconstruction pseudo-qmf banks. In 1994 IEEE.
  • Yuan-Pei Lin & al: A kaiser window approach for the design of prototype filters of cosine modulated filterbanks. In 1998 IEEE.

References - 2

Deep learning

  • Marco Tagliasacchi & al: Seanet : A multi-modal speech enhancement network. arXiv preprint, 2020.
  • Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon, “Audio super-resolution using neural nets,” in ICLR (WorkshopTrack), 2017
  • Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
  • Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, and Dominik Roblek, “Real-time speech frequency bandwidth extension,” in ICASSP 2021-2021 IEEE nternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 691–695.
  • Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron C Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
  • Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29.

References - 3

Évaluation

  • Antony W Rix & al : Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE.
  • Cees H Taal & al: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE.
  • Jonathan Le Roux & al: Sdr–half-baked or well done ? In ICASSP 2019.
  • Yi Luo & al : time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE.
  • B Series, “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Assembly, 2014.
  • Dan Barry, Qijian Zhang, Pheobe Wenyi Sun, and Andrew Hines, “Go listen: an end-to-end online listening test platform,” Journal of Open Research Software, vol. 9, no. 1, 2021.