Demo page

Content-Aware Style Augmentation for Zero-shot Voice Conversion with Short Target Speech

Authors

Hyeonjin Cha, Seyun Um, Miseul Kim, Changhwan Kim, Seungshin Lee and Hong-Goo Kang

Proposed Method

inference image

In this letter, we propose a neural zero-shot voice conversion (ZS-VC) system that simultaneously achieves high speaker similarity and speech intelligibility by incorporating a content-aware style generation module. Although recent neural ZS-VC systems have shown strong performance in either speaker similarity or speech intelligibility, attaining high performance in both remains challenging, especially when only a short target speech sample is available. We attribute this limitation to the insufficient content problem—where the linguistic content of the target speech fails to fully cover that of the source speech. To address this issue, we introduce a method that augments the target speaker’s style features for underrepresented content using self-supervised feature generation. Experimental results demonstrate that the proposed system, when integrated with the feature matching-based approach kNN-VC, outperforms existing methods in both key metrics.

Samples

Baseline: kNN-VC, Phoneme Hallucinator (PH)

Female to female

Source Target Source transcript
What a sweet voiced little innocent it was to be sure. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed
Source Target Source transcript
But this subject will be more properly discussed when we treat of the different races of mankind.
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed

Male to female

Source Target Source transcript
He turned away from her suddenly and set off across the strand. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed
Source Target Source transcript
This was at the march election eighteen fifty five. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed

Female to male

Source Target Source transcript
There was no man sir. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed
Source Target Source transcript
The word was often on miss taylor’s lips and she recognized it. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed

Male to male

Source Target Source transcript
He knew well enough that mother’s cap was the best cap in all the world. ━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed
Source Target Source transcript
A recent memory has usually more context than a more distant one. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VC models target 2 sec target 3 sec target 6 sec target 9 sec
kNN-VC
PH
Proposed