Deep speech inpainting of time-frequency masks

This website accompanies Interspeech2020 paper under the same title.
Mikolaj Kegler*,1,2, Pierre Beckmann*,1,3, Milos Cernak1,
1 - Logitech Europe S.A., Lausanne, Switzerland
2 - Centre for Neurotechnology & Department of Bioengineering, Imperial College London (ICL), UK
3 - Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
* - equal contribution

Deep Speech Inpainting

Inpainting figure

Fig.1 Deep speech inpainting framework & training. Learn more about our deep speech feature extractor below.

We developed a deep learning framework for speech inpainting, the context-based retrieval of large portions of missing or severely degraded time-frequency representations of speech. The network is based on the U-Net architecture and trained using deep feature loss to recover distorted input. The feature loss was obtained through speechVGG deep speech feature extractor, pre-trained on an auxiliary word recognition task. This allowed us to emphasize the speech-specific features during the training of the network and ultimately led to the best performance. The framework can operate both when the mask distorting the input is known (informed-) or not (blind inpainting). In the latter case, the network can simultaneously identify and recover distorted portions of time-frequency representations of speech. For more details and full evaluation, see the original paper.


We prepared over 30 interactive demo samples of our speech inpainting framework in action. The speech samples were obtained from LibriSpeech dataset used in our original paper. Below you can find a speech sample corrupted by removing a part of its time-frequency representation according to a random mask (A). Then we present the result of processing the degraded sample through our speech inpainting framework (B). Finally, the last sample is the original recording (C). Click "Next demo" at the bottom to go to the next sample. Consecutive sets of audio samples are unrelated and presented in a randomized order. Give it a listen!
A) Corrupted speech sample - missing time & frequency information
B) Inpainted (i.e. recovered) speech sample
C) Original speech sample

Loading next demo...


sVGG figure1 sVGG figure2

Fig.2 speechVGG, a flexible, transferable feature extractor for speech processing.

speechVGG is a deep speech feature extractor, tailored specifically for applications in representation and transfer learning in speech processing problems. The extractor adopts the classic VGG-16 architecture and is trained via the word recognition task. We showed that extractor can capture generalized speech-specific features in a hierarchical fashion. Importantly, the generalized representation of speech captured by the pre-trained model is transferable over distinct speech processing tasks, employing a different dataset. In our experiments, we showed that even relatively simple applications of the pre-trained speechVGG were capable of achieving results comparable to the state-of-the-art, presumably thanks to the knowledge transfer. For more details and full evaluation, see the original paper. Python implementation of the speechVGG and models pre-trained on the LibriSpeech dataset are openly available at: