PhD defense — Julien Hauret
September 12, 2025 - Cnam Paris - Laussédat Amphitheater
Supervisor: Éric Bavu - Co-supervisor: Thomas Joubaud
\[ \mathcal{L_D}= \mathbb{E}_y\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1-D_{k,t}(y))\right] + \mathbb{E}_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1+D_{k,t}(G(x)))\right] \] (1) Discriminator loss
\[ \mathcal{L}_\mathcal{G}^{adv}= \mathbb{E}_x\left[ \frac{1}{K} \sum_{k \in [0,3]} \frac{1}{T_{k,L_k}} \sum_t \max(0,1-D_{k,t}(G(x)))\right] \] (2) Generator adversarial loss
\[ \mathcal{L}_\mathcal{G}^{feat}= \mathbb{E}_x \left[ \frac{1}{K} \sum_{\substack{k \in [0,3] \\ l \in [1,L_k [ }} \frac{1}{T_{k,l}F_{k,l}} \sum_t \frac{\left\| D_{k,t}^{(l)}(y)-D_{k,t}^{(l)}(G(x)) \right\|_{L_1}}{ \mathrm{mean}(D_{k,t}^{(l)}(G(x)))}\right] \] (3) Generator feature loss
\[ \mathcal{L}_\mathcal{G}^{spec} = \mathbb{E}_{x, y} \left[ \sum_{\substack{f_r \in \{512, 1024, 2048\} \\ h_r \in \{50, 120, 240\} \\ w_r \in \{240, 600, 1200\}}} \left\| \Psi\left(\mathrm{STFT}_{f_r, h_r, w_r}(y)\right) - \Psi\left(\mathrm{STFT}_{f_r, h_r, w_r}(G(x))\right) \right\|_{L_1} \right] \] (4) Generator spectral loss
| Speech | Audio | eSTOI | Noresqa-MOS |
|---|---|---|---|
| Simulated In-ear | 0.83 | 2.57 | |
| Audio U-net | 0.87 | 2.59 | |
| Hifi-GAN v3 | 0.78 | 3.70 | |
| Streaming Seanet | 0.89 | 3.91 | |
| Seanet | 0.89 | 4.25 | |
| EBEN (ours) | 0.89 | 4.02 |
| Speech | \[P_{gen}\] | \[P_{dis}\] | \[\tau~\textrm{(ms)}\] | \[\delta~\textrm{(MB)}\] |
|---|---|---|---|---|
| Audio U-net | 71.0 M | \[\emptyset\] | 37.5 | 1117.3 |
| Hifi-GAN v3 | 1.5 M | 70.7 M | 3.1 | 22.2 |
| Streaming Seanet | 0.7 M | 56.6 M | 7.5 | 10.9 |
| Seanet | 8.3 M | 56.6 M | 13.1 | 89.2 |
| EBEN (ours) | 1.9 M | 27.8 M | 4.3 | 20 |
| Test set | eSTOI | Noresqa-MOS |
|---|---|---|
| In-domain (simulation) | 0.89 | 4.02 |
| Real signals | 0.51 | 3.82 |
| Subset | Weighting | Leq [dB] | Lmin [dB] | Lmax [dB] |
|---|---|---|---|---|
speech-noisy (3 hours) |
Linear | 94.1 | 63.3 | 108.7 |
| A | 90.8 | 57.7 | 106.6 | |
speechless-noisy (8 hours) |
Linear | 89.6 | 31.9 | 103.2 |
| A | 88.0 | 17.9 | 104.4 |
speech-clean | Sensor | Configuration | eSTOI | Noresqa-MOS |
|---|---|---|---|
| Forehead | Raw signal | 0.731 | 3.760 |
| EBEN (M=4, P=4, Q=4) | 0.855 | 4.250 | |
| Soft In-ear | Raw signal | 0.752 | 3.315 |
| EBEN (M=4, P=2, Q=4) | 0.868 | 4.331 | |
| Rigid In-ear | Raw signal | 0.782 | 3.392 |
| EBEN (M=4, P=2, Q=4) | 0.877 | 4.285 | |
| Throat | Raw signal | 0.677 | 3.097 |
| EBEN (M=4, P=2, Q=4) | 0.834 | 3.862 | |
| Temple | Raw signal | 0.602 | 2.905 |
| EBEN (M=4, P=1, Q=4) | 0.763 | 3.632 |
speech-noisy | Sensor | Initialization | Squim-eSTOI | Noresqa-MOS |
|---|---|---|---|
| Forehead | Raw signal | 0.901 | 3.85 |
| Tested from pretrained* | 0.949 | 4.08 | |
| Trained† from scratch | 0.949 | 4.14 | |
| Trained† from pretrained* | 0.971 | 4.20 | |
| Rigid In-ear | Raw signal | 0.751 | 3.47 |
| Tested from pretrained* | 0.812 | 3.73 | |
| Trained† from scratch | 0.876 | 3.51 | |
| Trained† from pretrained* | 0.873 | 3.80 | |
| Throat | Raw signal | 0.942 | 3.71 |
| Tested from pretrained* | 0.969 | 4.05 | |
| Trained† from scratch | 0.978 | 3.87 | |
| Trained† from pretrained* | 0.971 | 3.98 |
speech-cleanspeech-clean + speechless-noisy
speech-noisy samplesNoisy headset
Headset enhanced (Sepformer)
Noisy Throat
Throat enhanced (EBEN)
| Metric | Pearson Correlation Coefficient |
|---|---|
| Intelligibility | |
| STOI | 0.52 |
| ABC-MRT | 0.57 |
| 1 - PER | 0.45 |
| Quality | |
| STOI | 0.87 |
| PESQ | 0.81 |
| N-MOS | 0.76 |
| Identity | |
| ECAPA2 | 0.90 |
speech-clean for the throat microphone | Method | Signal 1 | Signal 2 | Signal 3 | Signal 4 |
|---|---|---|---|---|
| Corrupted | ||||
| Reference | ||||
| EBEN | ||||
| Mimi finetuning | ||||
| NeMo FlowMatching (Ku et al., ICASSP 2025) |
| Approach | Parameters | eSTOI | Noresqa-MOS | PER | SI-SDR |
|---|---|---|---|---|---|
| Raw Throat | – | 0.67 | 3.10 | 50.8% | –8.0 |
| EBEN | 1.9M | 0.834 | 3.86 | 18.6% | 3.2 |
| Mimi finetuning | 96.2M | 0.841 | 4.11 | 15.1% | 1.37 |
| NeMo FlowMatching (Ku et al., ICASSP 2025) |
430M | 0.822 | 4.39 | 7.6% | 7.3 |
| Measure | Frequency resolution | No Linear assumption | Formula |
|---|---|---|---|
| Coherence | ✅ | ✅ | \(\displaystyle \gamma^2_{xy}(f) = \frac{|S_{xy}(f)|^2}{S_{xx}(f)S_{yy}(f)}\) |
| Transfer function | ✅ | ❌ | \(\displaystyle H(f) = \frac{Y(f)}{X(f)}\) |
| Concordance coefficient | ❌ | ✅ | \(\displaystyle \rho_c = \frac{2\rho\sigma_x\sigma_y}{\sigma_x^2+\sigma_y^2+(\mu_x-\mu_y)^2}\) |
| Filterbank | Real-valued | Short filter length | No redundancy | No perceptual bias | Easy manipulation |
|---|---|---|---|---|---|
| PQMF (critically sampled) |
✅ | ✅ | ✅ | ✅ | ✅ |
| Paraunitary FIR | ✅ | ❌ | ✅ | ✅ | ❌ |
| Oversampled FB | ✅ | ❌ | ❌ | ✅ | ❌ |
| Non-uniform FB | ✅ | ❌ | ❌ | ❌ | ❌ |
| DFT-modulated | ❌ | ✅ | ✅ | ✅ | ❌ |
| Speech | PESQ | SI-SDR | eSTOI | Noresqa-MOS |
|---|---|---|---|---|
| Simulated In-ear | 2.42 | 8.4 | 0.83 | 2.57 |
| Audio U-net | 2.24 | 11.9 | 0.87 | 2.59 |
| Hifi-GAN v3 | 1.32 | -25.1 | 0.78 | 3.70 |
| Streaming Seanet | 2.01 | 11.2 | 0.89 | 3.91 |
| Seanet | 1.92 | 11.1 | 0.89 | 4.25 |
| EBEN (ours) | 2.08 | 10.9 | 0.89 | 4.02 |
silence noise level silence-mean.JPG
speech-noisy noise level speech-noisy-mean.JPG
speech-noisy-min.JPG
speech-noisy-max.JPG
speechless-noisy noise level speechless-noisy-mean.JPG
speechless-noisy-min.JPG
speechless-noisy-max.JPG
speech-clean
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speechless-clean
speech-clean
speech-noisy
speechless-noisy
speechless-clean
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-clean
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-noisy
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speechless-noisy
Headset
Forehead
Soft in-ear
Rigid in-ear
Throat
Temple
speech-noisy samplesNoisy headset
Headset enhanced (Sepformer)
Noisy Throat
Throat enhanced (EBEN)
librispeech-test-clean