PhD defense — Julien Hauret
September 12, 2025 - Cnam Paris - Laussédat Amphitheater
Supervisor: Éric Bavu - Co-supervisor: Thomas Joubaud
\[ \mathcal{L_D}= \mathbb{E}_y\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1-D_{k,t}(y))\right] + \mathbb{E}_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1+D_{k,t}(G(x)))\right] \] (1) Discriminator loss
\[ \mathcal{L}_\mathcal{G}^{adv}= \mathbb{E}_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1-D_{k,t}(G(x)))\right] \] (2) Generator adversarial loss
\[ \mathcal{L}_\mathcal{G}^{feat}= \mathbb{E}_x \left[ \frac{1}{K} \sum_{\substack{k \in [0,3] \\ l \in [1,L_k [ }} \frac{1}{T_{k,l}F_{k,l}} \sum_t \frac{\left\| D_{k,t}^{(l)}(y)-D_{k,t}^{(l)}(G(x)) \right\|_{L_1}}{ \mathrm{mean}(D_{k,t}^{(l)}(G(x)))}\right] \] (3) Generator feature loss
\[ \mathcal{L}_\mathcal{G}^{spec} = \mathbb{E}_{x, y} \left[ \sum_{\substack{f_r \in \{512, 1024, 2048\} \\ h_r \in \{50, 120, 240\} \\ w_r \in \{240, 600, 1200\}}} \left\| \Psi\left(\mathrm{STFT}_{f_r, h_r, w_r}(y)\right) - \Psi\left(\mathrm{STFT}_{f_r, h_r, w_r}(G(x))\right) \right\|_{L_1} \right] \] (4) Generator spectral loss
Speech | Audio | eSTOI | Noresqa-MOS |
---|---|---|---|
Simulated In-ear | 0.83 | 2.57 | |
Audio U-net | 0.87 | 2.59 | |
Hifi-GAN v3 | 0.78 | 3.70 | |
Streaming Seanet | 0.89 | 3.91 | |
Seanet | 0.89 | 4.25 | |
EBEN (ours) | 0.89 | 4.02 |
Speech | \[P_{gen}\] | \[P_{dis}\] | \[\tau~\textrm{(ms)}\] | \[\delta~\textrm{(MB)}\] |
---|---|---|---|---|
Audio U-net | 71.0 M | \[\emptyset\] | 37.5 | 1117.3 |
Hifi-GAN v3 | 1.5 M | 70.7 M | 3.1 | 22.2 |
Streaming Seanet | 0.7 M | 56.6 M | 7.5 | 10.9 |
Seanet | 8.3 M | 56.6 M | 13.1 | 89.2 |
EBEN (ours) | 1.9 M | 27.8 M | 4.3 | 20 |
Test set | eSTOI | Noresqa-MOS |
---|---|---|
In-domain (simulation) | 0.89 | 4.02 |
Real signals | 0.51 | 3.82 |
Subset | Weighting | Leq [dB] | Lmin [dB] | Lmax [dB] |
---|---|---|---|---|
speech-noisy (3 hours) |
Linear | 94.1 | 63.3 | 108.7 |
A | 90.8 | 57.7 | 106.6 | |
speechless-noisy (8 hours) |
Linear | 89.6 | 31.9 | 103.2 |
A | 88.0 | 17.9 | 104.4 |
speech-clean
Sensor | Configuration | eSTOI | Noresqa-MOS |
---|---|---|---|
Forehead | Raw signal | 0.731 | 3.760 |
EBEN (M=4, P=4, Q=4) | 0.855 | 4.250 | |
Soft In-ear | Raw signal | 0.752 | 3.315 |
EBEN (M=4, P=2, Q=4) | 0.868 | 4.331 | |
Rigid In-ear | Raw signal | 0.782 | 3.392 |
EBEN (M=4, P=2, Q=4) | 0.877 | 4.285 | |
Throat | Raw signal | 0.677 | 3.097 |
EBEN (M=4, P=2, Q=4) | 0.834 | 3.862 | |
Temple | Raw signal | 0.602 | 2.905 |
EBEN (M=4, P=1, Q=4) | 0.763 | 3.632 |
speech-noisy
Sensor | Initialization | Squim-eSTOI | Noresqa-MOS |
---|---|---|---|
Forehead | Raw signal | 0.901 | 3.85 |
Tested from pretrained* | 0.949 | 4.08 | |
Trained† from scratch | 0.949 | 4.14 | |
Trained† from pretrained* | 0.971 | 4.20 | |
Rigid In-ear | Raw signal | 0.751 | 3.47 |
Tested from pretrained* | 0.812 | 3.73 | |
Trained† from scratch | 0.876 | 3.51 | |
Trained† from pretrained* | 0.873 | 3.80 | |
Throat | Raw signal | 0.942 | 3.71 |
Tested from pretrained* | 0.969 | 4.05 | |
Trained† from scratch | 0.978 | 3.87 | |
Trained† from pretrained* | 0.971 | 3.98 |
speech-clean
speech-clean
+ speechless-noisy
speech-noisy
samplesNoisy headset
Headset enhanced (Sepformer)
Noisy Throat
Throat enhanced (EBEN)
Metric | Pearson Correlation Coefficient |
---|---|
Intelligibility | |
STOI | 0.52 |
ABC-MRT | 0.57 |
1 - PER | 0.45 |
Quality | |
STOI | 0.87 |
PESQ | 0.81 |
N-MOS | 0.76 |
Identity | |
ECAPA2 | 0.90 |
speech-clean
for the throat microphone Method | Signal 1 | Signal 2 | Signal 3 | Signal 4 |
---|---|---|---|---|
Corrupted | ||||
Reference | ||||
EBEN | ||||
Mimi finetuning | ||||
NeMo FlowMatching (Ku et al., ICASSP 2025) |
Approach | Parameters | eSTOI | Noresqa-MOS | PER | SI-SDR |
---|---|---|---|---|---|
Raw Throat | – | 0.67 | 3.10 | 50.8% | –8.0 |
EBEN | 1.9M | 0.834 | 3.86 | 18.6% | 3.2 |
Mimi finetuning | 96.2M | 0.841 | 4.11 | 15.1% | 1.37 |
NeMo FlowMatching (Ku et al., ICASSP 2025) |
430M | 0.822 | 4.39 | 7.6% | 7.3 |
Measure | Frequency resolution | No Linear assumption | Formula |
---|---|---|---|
Coherence | ✅ | ✅ | \(\displaystyle \gamma^2_{xy}(f) = \frac{|S_{xy}(f)|^2}{S_{xx}(f)S_{yy}(f)}\) |
Transfer function | ✅ | ❌ | \(\displaystyle H(f) = \frac{Y(f)}{X(f)}\) |
Concordance coefficient | ❌ | ✅ | \(\displaystyle \rho_c = \frac{2\rho\sigma_x\sigma_y}{\sigma_x^2+\sigma_y^2+(\mu_x-\mu_y)^2}\) |
Filterbank | Real-valued | Short filter length | No redundancy | No perceptual bias | Easy manipulation |
---|---|---|---|---|---|
PQMF (critically sampled) |
✅ | ✅ | ✅ | ✅ | ✅ |
Paraunitary FIR | ✅ | ❌ | ✅ | ✅ | ❌ |
Oversampled FB | ✅ | ❌ | ❌ | ✅ | ❌ |
Non-uniform FB | ✅ | ❌ | ❌ | ❌ | ❌ |
DFT-modulated | ❌ | ✅ | ✅ | ✅ | ❌ |
Speech | PESQ | SI-SDR | eSTOI | Noresqa-MOS |
---|---|---|---|---|
Simulated In-ear | 2.42 | 8.4 | 0.83 | 2.57 |
Audio U-net | 2.24 | 11.9 | 0.87 | 2.59 |
Hifi-GAN v3 | 1.32 | -25.1 | 0.78 | 3.70 |
Streaming Seanet | 2.01 | 11.2 | 0.89 | 3.91 |
Seanet | 1.92 | 11.1 | 0.89 | 4.25 |
EBEN (ours) | 2.08 | 10.9 | 0.89 | 4.02 |
silence
noise level silence-mean.JPG
speech-noisy
noise level speech-noisy-mean.JPG
speech-noisy-min.JPG
speech-noisy-max.JPG
speechless-noisy
noise level speechless-noisy-mean.JPG
speechless-noisy-min.JPG
speechless-noisy-max.JPG
speech-clean
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speechless-clean
speech-clean
speech-noisy
speechless-noisy
speechless-clean
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-clean
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-noisy
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speechless-noisy
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-noisy
samplesNoisy headset
Headset enhanced (Sepformer)
Noisy Throat
Throat enhanced (EBEN)
librispeech-test-clean