Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion

Tuan Dinh, Alexander Kain, Kris Tjaden

Increasing speech intelligibility for hearing-impaired listenersand normal-hearing listeners in noisy environments remains achallenging problem. Spectral style conversion from habitualto clear speech is a promising approach to address the problem. Motivated by the success of generative adversarial networks (GANs) in various applications of image and speech processing, we explore the potential of conditional GANs (cGANs)to learn the mapping from habitual speech to clear speech. We evaluated the performance of cGANs in three tasks: 1) speaker-dependent one-to-one mappings, 2) speaker-independent many-to-one mappings, and 3) speaker-independent many-to-many mappings. In the first task, cGANs outperformed a traditional deep neural network mapping in terms of average keyword re-call accuracy and the number of speakers with improved intelligibility. In the second task, we significantly improved intelligibility of one of three speakers, without any source speaker training data. In the third and most challenging task, we improved keyword recall accuracy for two of three speakers, but without statistical significance.

Download paper

One-to-one style conversion

Speech samples were mixed with babble noise at 0dB SNR

Speaker Vocoded Habitual Speech DNN GAN Oracle Vocoded Clear Speech
C_M7: control male
PD_M6: Parkinson disease male

Many-to-one style conversion

We convert PD_M6 to C_M7. Speech samples were mixed with babble noise at 0dB SNR

PD_M6: Vocoded Habitual Speech GAN Oracle C_M10: Vocoded Clear Speech

Many-to-many style conversion

Speech samples were mixed with babble noise at 0dB SNR

Speaker Vocoded Habitual Speech GAN Oracle Vocoded Clear Speech
C_M7: control male