eben

EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones

Scientific Day 2023 – Computer Science X Circuits and Systems
ISEP campus - 15/06/2023

Julien HAURET^☆, Thomas JOUBAUD^†, Véronique ZIMPFER^†, Éric BAVU^☆
☆ : LMSSC, Conservatoire national des arts et métiers, Paris, France, HESAM Université
† : Department of Acoustics and Soldier Protection, French-German Research Institute of Saint-Louis (ISL)

Context

BANG Prototype

Quiet conditions

Reference

In-ear

85dB noise

Reference

In-ear

Spectrograms

Degradation

Pseudo Quadrature Mirror Filters

Pseudo Quadrature Mirror Filter (PQMF) banks

In practice

EBEN : Extreme Bandwidth Extension Network

Loss functions

\[ \mathcal{L_D}= \underbrace{E_y\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1-D_{k,t}(y))\right]}_\textrm{real adversarial} + \underbrace{E_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1+D_{k,t}(G(x)))\right]}_\textrm{fake adversarial} \] \[\mathcal{L_G}= \underbrace{E_x\left[ \frac{1}{K} \displaystyle \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t max(0,1-D_{k,t}(G(x)))\right]}_\textrm{adversarial} + \underbrace{ E_x\left[ \frac{1}{K} \displaystyle \sum_{\substack{k \in [0,3] \\ l \in [1,L_k [ }} \frac{1}{T_{k,l}F_{k,l}} \displaystyle \sum_t \| D_{k,t}^{(l)}(y)-D_{k,t}^{(l)}(G(x))\| _{L_1} \right]}_\textrm{feature matching} \]

Results

Qualitative results: EBEN

Objective metrics

Speech	PESQ	SI-SDR	STOI
Simulated In-ear	2.42 (0.34)	8.4 (3.7)	0.83 (0.05)
Audio U-net	2.24 (0.49)	11.9 (3.7)	0.87 (0.04)
Hifi-GAN v3	1.32 (0.16)	-25.1 (11.4)	0.78 (0.04)
Seanet	1.92 (0.48)	11.1 (3.0)	0.89 (0.04)
Streaming Seanet	2.01 (0.46)	11.2 (3.6)	0.89 (0.04)
EBEN (ours)	2.08 (0.45)	10.9 (3.3)	0.89 (0.04)

MUSHRA

Frugality indicators

Speech	\[P_{gen}\]	\[P_{dis}\]	\[\tau~\textrm{(ms)}\]	\[\delta~\textrm{(MB)}\]
Audio U-net	71.0 M	\[\emptyset\]	37.5	1117.3
Hifi-GAN v3	1.5 M	70.7 M	3.1	22.2
Seanet	8.3 M	56.6 M	13.1	89.2
Streaming Seanet	0.7 M	56.6 M	7.5	10.9
EBEN (ours)	1.9 M	27.8 M	4.3	20

Dataset

Thank you for your attention

julien.hauret@lecnam.net | https://jhauret.github.io/eben/

Signal processing equations

Source-Filter model \[y(t) = (h*x)(t) \] \[Y(f) = H(f).X(f) \]

Transfer function estimation \[H(f)=\frac{P_{xy}(f)}{P_{xx}(f)} \] \[C_{xy}(f)=\frac{|P_{xy}(f)|^2}{P_{xx}(f) . P_{yy}(f)} \]

References - 1

PQMF

Joseph Rothweiler: Polyphase quadrature filters–a new subband coding technique. In ICASSP 1983.
Truong Q Nguyen: Near-perfect-reconstruction pseudo-qmf banks. In 1994 IEEE.
Yuan-Pei Lin & al: A kaiser window approach for the design of prototype filters of cosine modulated filterbanks. In 1998 IEEE.

References - 2

Deep learning

Marco Tagliasacchi & al: Seanet : A multi-modal speech enhancement network. arXiv preprint, 2020.
Volodymyr Kuleshov, S Zayd Enam, and Stefano Ermon, “Audio super-resolution using neural nets,” in ICLR (WorkshopTrack), 2017
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, and Dominik Roblek, “Real-time speech frequency bandwidth extension,” in ICASSP 2021-2021 IEEE nternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 691–695.
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron C Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29.

References - 3

Évaluation

Antony W Rix & al : Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE.
Cees H Taal & al: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE.
Jonathan Le Roux & al: Sdr–half-baked or well done ? In ICASSP 2019.
Yi Luo & al : time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE.
B Series, “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Assembly, 2014.
Dan Barry, Qijian Zhang, Pheobe Wenyi Sun, and Andrew Hines, “Go listen: an end-to-end online listening test platform,” Journal of Open Research Software, vol. 9, no. 1, 2021.

EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones

Context

Context

BANG Prototype

Quiet conditions

85dB noise

Spectrograms

Degradation

Pseudo Quadrature Mirror Filters

In practice

EBEN : Extreme Bandwidth Extension Network

Loss functions

Results

Qualitative results: EBEN

Objective metrics

MUSHRA

Frugality indicators

Dataset

Dataset

Thank you for your attention

Appendices

Signal processing equations

References - 1

PQMF

References - 2

Deep learning

References - 3

Évaluation