Towards Duration Style Conversion Towards Duration Style Conversion
Tuan Dinh
Preliminary results of improving speech intelligibility using duration conversion. We use phase vocoder to convert duration of habitual to slow speech uniformly or non-uniformly. Uniform conversion applies a single sentence-level scaling factor on habitual speech. Non-uniform conversion applies multiple phoneme-level scaling factors on habitual speech. We assume these scaling factors are available. We also assume phoneme labels and boundaries are available in non-uniform conversion. The goal is to determine whether non-uniform modification is better than uniform conversion. The non-uniform conversion, however, is not better than the uniform conversion
Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
Increasing speech intelligibility for hearing-impaired listenersand normal-hearing listeners in noisy environments remains achallenging problem. Spectral style conversion from habitualto clear speech is a promising approach to address the problem. Motivated by the success of generative adversarial networks (GANs) in various applications of image and speech processing, we explore the potential of conditional GANs (cGANs)to learn the mapping from habitual speech to clear speech. We evaluated the performance of cGANs in three tasks: 1) speaker-dependent one-to-one mappings, 2) speaker-independent many-to-one mappings, and 3) speaker-independent many-to-many mappings. In the first task, cGANs outperformed a traditional deep neural network mapping in terms of average keyword re-call accuracy and the number of speakers with improved intelligibility. In the second task, we significantly improved intelligibility of one of three speakers, without any source speaker training data. In the third and most challenging task, we improved keyword recall accuracy for two of three speakers, but without statistical significance.
Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency Increasing the Intelligibility and Naturalness of Alaryngeal Speech Using Voice Conversion and Synthetic Fundamental Frequency
Tuan Dinh, Alexander Kain, Robin Samlan, Beiming Cao, Jun Wang
Individuals who undergo a laryngectomy lose their ability to phonate. Yet current treatment options allow alaryngeal speech,they struggle in their daily communication and social life due to the low intelligibility of their speech. In this paper, we presentedtwo conversion methods for increasing intelligibility and naturalness of speech produced by laryngectomees (LAR). The first method used a deep neural network for predicting binary voicing/unvoicing or the degree of aperiodicity. The second methodused a conditional generative adversarial network to learn the mapping from LAR speech spectra to clearly-articulated speech spectra. We also created a synthetic fundamental frequency trajectory with an intonation model consisting of phrase and accent curves. For the two conversion methods, we showed that adaptation always increased the performance of pre-trained models,objectively. In subjective testing involving four LAR speakers,we significantly improved the naturalness of two speakers, andwe also significantly improved the intelligibility of one speaker.
Summary
Using a Manifold Vocoder for Spectral Voice and Style Conversion Using a Manifold Vocoder for Spectral Voice and Style Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.
Pagination