voice-changer/tutorials/tutorial_rvc_en_latest.md
2023-07-04 10:20:54 +09:00

11 KiB

Realtime Voice Changer Client for RVC Tutorial (v.1.5.3.7)

Japanese/日本語

Introduction

This application is client software for real-time voice conversion that supports various voice conversion models. This application support the models including RVC, MMVCv13, MMVCv15, So-vits-svcv40, etc. However, this document focus on RVC(Retrieval-based-Voice-Conversion) for voice conversion as the tutorial material. The basic operations for each model are essentially the same.

From the following, the original Retrieval-based-Voice-Conversion-WebUI is referred to as the original-RVC, RVC-WebUI created by ddPn08 is referred to as ddPn08-RVC.

Notes

Steps up to startup

Start GUI

Windows version,

Unzip the downloaded zip file and run start_http.bat.

Mac version

After extracting the download file, execute startHttp.command. If it shows that the developer cannot be verified, press the control key again and click to execute (or right-click to execute).

Precautions when connecting remotely

When connecting remotely, please use .bat file (win) and .command file (mac) where http is replaced with https.

Console

When you run a .bat file (Windows) or .command file (Mac), a screen like the following will be displayed and various data will be downloaded from the Internet at the initial start-up. Depending on your environment, it may take 1-2 minutes in many cases.

image

GUI

Once the download of the required data is complete, a dialog like the one below will be displayed. If you wish, press the yellow icon to reward the developer with a cup of coffee. Pressing the Start button will make the dialog disappear.

image

GUI Overview

Use this screen to operate.

image

Quick start

You can immediately perform voice conversion using the data downloaded at startup.

Operation

(1) To get started, click on the Model Selection area to select the model you would like to use. Once the model is loaded, the images of the characters will be displayed on the screen.

(2) Select the microphone (input) and speaker (output) you wish to use. If you are unfamiliar, we recommend selecting the client and then selecting your microphone and speaker. (We will explain the difference between server later).

(3) When you press the start button, the audio conversion will start after a few seconds of data loading. Try saying something into the microphone. You should be able to hear the converted audio from the speaker.

image

FAQ on Quick Start

Q1. The audio is becoming choppy and stuttering.

A1. It is possible that your PC's performance is not adequate. Try increasing the CHUNK value (as shown in Figure as A, for example, 1024). Also try setting F0 Det to dio (as shown in Figure as B).

image

Q2. The voice is not being converted.

A2. Refer to this and identify where the problem lies, and consider a solution.

Q3. The pitch is off.

A3. Although it wasn't explained in the Quick Start, if the model is pitch-changeable, you can change it with TUNE. Please refer to the more detailed explanation below.

Configurable items

Title

image

Icons are links.

Icon To
Octocat github repository
question manual
spanner tools
coffee donation

claer setting

Initialize configuration.

Model Selection

image

Select the model you wish to use.

By pressing the "edit" button, you can edit the list of models (model slots). Please refer to the model slots editing screen for more details.

Main Control

image

A character image loaded on the left side will be displayed. The status of real-time voice changer is overlaid on the top left of the character image.

You can use the buttons and sliders on the right side to control various settings.

status of real-time voice changer

The lag time from speaking to conversion is buf + res seconds. When adjusting, please adjust the buffer time to be longer than the res time.

vol

This is the volume after voice conversion.

buf

The length of each chunk in milliseconds when capturing audio. Shortening the CHUNK will decrease this number.

res

The time it takes to convert data with CHUNK and EXTRA added is measured. Decreasing either CHUNK or EXTRA will reduce the number.

Control

start/stop button

Press "start" to begin voice conversion and "stop" to end it.

GAIN

  • in: Change the volume of the inputted audio for the model.

  • out: Change the volume of the converted audio.

TUNE

Enter a value for how much to convert the pitch of the voice. Conversion can also be done during inference. Below are some guidelines for settings.

  • +12 for male voice to female voice conversion
  • -12 for female voice to male voice conversion

INDEX (Only for RVC)

You can specify the rate of weight assigned to the features used in training. This is only valid for models which have an index file registered. 0 uses HuBERT's output as-is and 1 assigns all weights to the original features. If the index ratio is greater than 0, it may take longer to search.

Voice

Set the speaker of the audio conversion.

save setting

Save the settings specified. When the model is recalled again, the settings will be reflected. (Excluding some parts).

export to onnx

This output will convert the PyTorch model to ONNX. It is only valid if the loaded model is a RVC PyTorch model.

Others

The item that can be configured by the AI model used will vary. Please check the features and other information on the model manufacturer's website.

Configuration

image

You can review the action settings and transformation processes.

NOISE

You can switch the noise cancellation feature on and off, however it is only available in Client Device Mode.

  • Echo: Echo Cancellation Function
  • Sup1, Sup2: This is a noise suppression feature.

F0 Det (F0 Estimator)

Choose an algorithm for extracting the pitch. You can choose from the following options.

  • Lightweight dio
  • High-precision harvest
  • GPU-enabled crepe

S. Thresh (Noise Gate)

This is the threshold of the volume for performing speech conversion. When the rms is smaller than this value, speech conversion will be skipped and silence will be returned instead. (In this case, since the conversion process is skipped, the burden will not be so large.)

CHUNK (Input Chunk Num)

Decide how much length to cut and convert in one conversion. The higher the value, the more efficient the conversion, but the larger the buf value, the longer the maximum time before the conversion starts. The approximate time is displayed in buff:.

EXTRA (Extra Data Length)

Determines how much past audio to include in the input when converting audio. The longer the past voice is, the better the accuracy of the conversion, but the longer the res is, the longer the calculation takes. (Probably because Transformer is a bottleneck, the calculation time will increase by the square of this length)

Detail is here

GPU

You can select the GPU to use in the onnxgpu version.

In the onnxdirectML version, you can switch the GPU ON/OFF.

AUDIO

Choose the type of audio device you want to use. For more information, please refer to the document.

  • Client: You can make use of the microphone input and speaker output with the GUI functions such as noise cancellation.
  • Server: VCClient can directly control the microphone and speaker to minimize latency.

input

You can select a sound input device such as a microphone input. It's also possible to input from audio files (size limit applies).

output

You can select audio output devices such as speakers and output.

monitor

In monitor mode, you can select audio output devices such as speaker output. This is only available in server device mode.

Please refer to this document for an overview of the idea.

REC.

It will output the converted audio to a file.

ServerIO Analizer

We can record and confirm the input audio to the speech conversion AI and the output audio from the speech conversion AI.

Please refer to this document for an overview of the idea.

SIO rec.

I will start/stop recording both the audio inputted into the voice conversion AI as well as the audio outputted from the voice conversion AI.

output

The AI will play back any audio that is input into it.

in

I will play the audio inputted to the speech conversion AI.

out

Play the audio output from the Speech Conversion AI.

more...

You can do more advanced operations.

Merge Lab

It is possible to do synthesis of models.

Advanced Setting

You can set up more advanced settings.

Server Info

You can check the configuration of the current server.