Content-Aware Style Augmentation for Zero-shot Voice Conversion with Short Target Speech
Authors
Hyeonjin Cha, Seyun Um, Miseul Kim, Changhwan Kim, Seungshin Lee and Hong-Goo Kang
Proposed Method

In this letter, we propose a neural zero-shot voice conversion (ZS-VC) system that simultaneously achieves high speaker similarity and speech intelligibility by incorporating a content-aware style generation module. Although recent neural ZS-VC systems have shown strong performance in either speaker similarity or speech intelligibility, attaining high performance in both remains challenging, especially when only a short target speech sample is available. We attribute this limitation to the insufficient content problem—where the linguistic content of the target speech fails to fully cover that of the source speech. To address this issue, we introduce a method that augments the target speaker’s style features for underrepresented content using self-supervised feature generation. Experimental results demonstrate that the proposed system, when integrated with the feature matching-based approach kNN-VC, outperforms existing methods in both key metrics.
Samples
Baseline: kNN-VC, Phoneme Hallucinator (PH)
Female to female
| Source | Target | Source transcript |
|---|---|---|
| What a sweet voiced little innocent it was to be sure. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
| Source | Target | Source transcript |
|---|---|---|
| But this subject will be more properly discussed when we treat of the different races of mankind. |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
Male to female
| Source | Target | Source transcript |
|---|---|---|
| He turned away from her suddenly and set off across the strand. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
| Source | Target | Source transcript |
|---|---|---|
| This was at the march election eighteen fifty five. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
Female to male
| Source | Target | Source transcript |
|---|---|---|
| There was no man sir. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
| Source | Target | Source transcript |
|---|---|---|
| The word was often on miss taylor’s lips and she recognized it. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
Male to male
| Source | Target | Source transcript |
|---|---|---|
| He knew well enough that mother’s cap was the best cap in all the world. ━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |
| Source | Target | Source transcript |
|---|---|---|
| A recent memory has usually more context than a more distant one. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
| VC models | target 2 sec | target 3 sec | target 6 sec | target 9 sec |
|---|---|---|---|---|
| kNN-VC | ||||
| PH | ||||
| Proposed |