Demo page

Content-Aware Style Augmentation for Zero-shot Voice Conversion with Short Target Speech

Authors

Hyeonjin Cha, Seyun Um, Miseul Kim, Changhwan Kim, Seungshin Lee and Hong-Goo Kang

Proposed Method

inference image

In this letter, we propose a neural zero-shot voice conversion (ZS-VC) system that simultaneously achieves high speaker similarity and speech intelligibility by incorporating a content-aware style generation module. Although recent neural ZS-VC systems have shown strong performance in either speaker similarity or speech intelligibility, attaining high performance in both remains challenging, especially when only a short target speech sample is available. We attribute this limitation to the insufficient content problem—where the linguistic content of the target speech fails to fully cover that of the source speech. To address this issue, we introduce a method that augments the target speaker’s style features for underrepresented content using self-supervised feature generation. Experimental results demonstrate that the proposed system, when integrated with the feature matching-based approach kNN-VC, outperforms existing methods in both key metrics.

Samples

Baseline: kNN-VC, Phoneme Hallucinator (PH)

Female to female

Source Target (2 sec) kNN-VC PH Proposed

Male to female

Source Target (2sec) kNN-VC PH Proposed

Female to male

Source Target (2sec) kNN-VC PH Proposed

Male to male

Source Target (2sec) kNN-VC PH Proposed