Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to change speaker encoder to one-hot encoder #102

Open
Jwaminju opened this issue Nov 9, 2021 · 6 comments
Open

How to change speaker encoder to one-hot encoder #102

Jwaminju opened this issue Nov 9, 2021 · 6 comments

Comments

@Jwaminju
Copy link

Jwaminju commented Nov 9, 2021

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files.
But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue.
The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare.
Maybe I have to change make_metadata.py file to use one-hot encoder.
I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256])
I used the data same as demo wavs file.

image

Could you help me how to code the one-hot encodings? Thank you.

@yenebeb
Copy link

yenebeb commented Nov 30, 2021

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss.
To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the make_metadata.py file.

I haven't tested it (and wrote this quite quickly), but something like this should work.
replace:

for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []

with:

# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    
    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1
    utterances.append(emb)

The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

@WGQ123-code
Copy link

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files. But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue. The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare. Maybe I have to change make_metadata.py file to use one-hot encoder. I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256]) I used the data same as demo wavs file.

image

Could you help me how to code the one-hot encodings? Thank you.

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss. To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the make_metadata.py file.

I haven't tested it (and wrote this quite quickly), but something like this should work. replace:

for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []

with:

# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []
    utterances.append(speaker)
    
    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1
    utterances.append(emb)

The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

Hi @yenebeb,
It's a pleasure to read your comments. I need use the speaker embedding, too. For a particular speaker, we know the position of '1'. If i use one-hot embedding, whether this trainning is not necessary.
I don't know if my understanding is correct. If not, please give me some guidance.
Thanks.

@WildFire212
Copy link

WildFire212 commented Dec 6, 2021

@yenebeb Thanks a lot for the comment!
I went through most of the issues in the repo, this is the only one that gives some explanation about the one-hot encoder.
I am still a bit confused.
By removing the second loop we would totally remove the mel-spectograms?
It would be really helpful if you can explain/point to a resource regarding this.

@yenebeb
Copy link

yenebeb commented Dec 16, 2021

@WGQ123-code short answer, yes it's important to train with the one-hot embedding.

Somewhat longer answer:
You replace the whole embedding with one-hot embeddings.
The 'only' difference between training with one-hot and the embedding generated by the GE2E encoder is their 'accuracy'. GE2E tries to create same embeddings for the same speaker this means that there is some kind of information about the speaker in the embedding. By training on the GE2E embedding the model is trained to recognise this information and thus is able to also work on unseen data (zero-shot learning). With one-hot embeddings you remove said information and force the model to train on the mel. The model does have to know which voices (mels) are from the same speaker however, this is why you need the one-hot embedding during training time.

@WildFire212
Yes you do remove the mel-spectograms but if you look carefully you'll notice that they're only used to create the GE2E embedding. Since you want one-hot embeddings this is not needed and will speed up the process quite a bit.
The mel's are actually created and saved when you run make_spect.

@WildFire212
Copy link

@yenebeb Thank you for the clarification!

@WGQ123-code
Copy link

@yenebeb Thank you very much for your guidance! Wish you a happy life!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants