Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train in ONE-HOT pattern? #68

Open
Lukelluke opened this issue Dec 10, 2020 · 1 comment
Open

How to train in ONE-HOT pattern? #68

Lukelluke opened this issue Dec 10, 2020 · 1 comment

Comments

@Lukelluke
Copy link

Hi, @auspicious3000 ,

I find all the files and issues, didn't find any description on 「How to train in one-hot pattern」, which u suggest we to train just in that mode, if we don't have the necessary to apply 「one-shot」performance.

Could any friend who have successfully trained Auto-VC in one-hot mode, not the embedding with pretrained speaker-encoder?

Hope to get any useful reply from u all !

All the best,
Luke Huang

@ruclion
Copy link

ruclion commented Dec 23, 2020

Hi
I have not trained one-hot version, but I have some idea to say~
The only difference between one-hot and speaker encoder version is: weather the speaker's embedding can be trained by AutoVC training process.
How to train in one-hot pattern, may like this:

  1. get the number of total speakers, maybe 40
  2. set a lookup embedding table, like multi-speaker tacotron2
  3. every time get the sentences to train, the input is: mels for content encoder, not use speaker embedding, just sent speaker id as input, and among lookup embedding table, then get a trainable embedding vector, and concat this vector with content vector
  4. when gradient back, speaker's embedding vector will change alittle
  5. for all the training process, the same speaker has same embedding vector; like word embedding.

In fact, 「How to train in one-hot pattern」in author's mind may be just the most simple way to train model when face to multi-speaker problem, it's better than speaker encoder version because it's embedding can change by gradient , but speaker encoder's embedding can not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants