We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skip connection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.
Evaluate the efficacy of VAE-12 in vocoding
Speaker | LSF vocoding | MCEP vocoding | VAE vocoding | Natural Speech |
---|---|---|---|---|
SF1: female | ||||
SM2: male |
Evaluate the efficacy of VAE-12 in voice conversion
Source | LSF mapping | MCEP-12 mapping | MCEP-40 mapping | VAE mapping | Target |
---|---|---|---|---|---|
Evaluate the efficacy of VAE-12 in Style Conversion for Intelligibility Improvement
Habitual speech | DNN mapping | proposed DNN mapping | Oracle | Clear Speech |
---|---|---|---|---|