Discl-VC: Disentangled Discrete Representation and In-Context Learning for Controllable Zero-Shot Voice Conversion

0. Contents

  1. Abstract
  2. Demos -- zero-shot voice conversion with timbre transformation and prosody preservation
  3. Demos -- zero-shot voice conversion with timbre and prosody transformation


1. Abstract

Currently, zero-shot voice conversion systems are capable of synthesizing the voice of unseen speakers. However, most existing approaches struggle to accurately replicate the speaking style of the source speaker or mimic the distinctive speaking style of the target speaker, thereby limiting the controllability of voice conversion. In this work, we propose Discl-VC, a novel voice conversion framework that disentangles content and prosody information from self-supervised speech representations and synthesizes the target speaker's voice through in-context learning with a flow matching transformer. To achieve fine-grained control over the generated speech's prosody, we introduce a mask generative transformer that predicts discrete prosody tokens in a non-autoregressive manner based on prompts. Experimental results demonstrate the superior performance of Discl-VC in zero-shot voice conversion and its remarkable accuracy in prosody control for synthesized speech.

描述图片的文字1

Figure 1: The overall architecture of our proposed system.

Our system is trained in two stages:

2. Demos -- zero-shot voice conversion with timbre transformation and prosody preservation

*Note: We additionally provide generated samples of DDDM-VC under the same conditions.

Part 1. Source speech

Timbre prompt DDDM-VC* FAcodec Vevo Discl-VC

Part 2. Source speech

Timbre prompt DDDM-VC* FAcodec Vevo Discl-VC

Part 3. Source speech

Timbre prompt DDDM-VC* FAcodec Vevo Discl-VC

Part 4. Source speech

Timbre prompt DDDM-VC* FAcodec Vevo Discl-VC

3. Demos -- zero-shot voice conversion with timbre and prosody transformation

Part 1. Source speech

Timbre and Prosody prompt Vevo Discl-VC

Part 2. Source speech

Timbre and Prosody prompt Vevo Discl-VC

Part 3. Source speech

Timbre and Prosody prompt Vevo Discl-VC

Part 4. Source speech

Timbre prompt Prosody prompt Vevo Discl-VC