Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. In particular, readily available models pre-trained on large datasets are key for the efficient transfer of knowledge. They can be applied as feature extractors for data preprocessing, fine-tuned to perform a variety of tasks, or used for computing feature losses in the training of deep learning systems. While applications of transfer learning are common in the fields of computer vision and natural language processing, audio- and speech processing are surprisingly lacking readily available and transferable models. Here, we introduce speechVGG, a flexible, transferable feature extractor tailored for integration with deep learning frameworks for speech processing. Our transferable model adopts the classic VGG-16 architecture and is trained on a spoken word classification task. We demonstrate the application of the pre-trained model in four speech processing tasks, including speech enhancement, language identification, speech, noise and music classification, and speaker identification. Each time, we compare the performance of our approach to existing baselines. Our results confirm that the representation of natural speech captured using speechVGG is transferable and generalizable across various speech processing problems and datasets. Notably, relatively simple applications of our pre-trained model are capable of achieving competitive results.
Submitted to InterSpeech2020.