Increasing speech intelligibility for hearing-impaired listenersand normal-hearing listeners in noisy environments remains achallenging problem. Spectral style conversion from habitualto clear speech is a promising approach to address the problem. Motivated by the success of generative adversarial networks (GANs) in various applications of image and speech processing, we explore the potential of conditional GANs (cGANs)to learn the mapping from habitual speech to clear speech. We evaluated the performance of cGANs in three tasks: 1) speaker-dependent one-to-one mappings, 2) speaker-independent many-to-one mappings, and 3) speaker-independent many-to-many mappings. In the first task, cGANs outperformed a traditional deep neural network mapping in terms of average keyword re-call accuracy and the number of speakers with improved intelligibility. In the second task, we significantly improved intelligibility of one of three speakers, without any source speaker training data. In the third and most challenging task, we improved keyword recall accuracy for two of three speakers, but without statistical significance.
Speech samples were mixed with babble noise at 0dB SNR
Speaker | Vocoded Habitual Speech | DNN | GAN | Oracle | Vocoded Clear Speech |
---|---|---|---|---|---|
C_M7: control male | |||||
PD_M6: Parkinson disease male |
We convert PD_M6 to C_M7. Speech samples were mixed with babble noise at 0dB SNR
PD_M6: Vocoded Habitual Speech | GAN | Oracle | C_M10: Vocoded Clear Speech |
---|---|---|---|
Speech samples were mixed with babble noise at 0dB SNR
Speaker | Vocoded Habitual Speech | GAN | Oracle | Vocoded Clear Speech |
---|---|---|---|---|
C_M7: control male |