Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the same mel feature in "metadata.pkl"? #84

Open
gnipping opened this issue Mar 30, 2021 · 19 comments
Open

How to get the same mel feature in "metadata.pkl"? #84

gnipping opened this issue Mar 30, 2021 · 19 comments

Comments

@gnipping
Copy link

I use your default parameter and code to compute the mel feature of "p225_001.wav" in VCTK corpus. However, I get the dimension of mel feature is (385, 80) not the dimension (90 ,80) in "metadata.pkl". Do you have extra processing steps?

@auspicious3000
Copy link
Owner

No I didn't.

@gnipping
Copy link
Author

So why the first dimension is not the same? And I use the mel feature whose feature is (385, 80) and your model, your speaker embedding in "metadata.pkl" to generate the audio "p225xp228", but only generate 6s strange voice. I cannot hear the word "please call stella". So how you reduce the dimension from 385 to 90?

@auspicious3000
Copy link
Owner

The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.

@gnipping
Copy link
Author

gnipping commented Mar 31, 2021

I use your code and your parameter in issue #4 to generate the mel feature, the hop_size is 256 and the result dimension is (385, 80). The code is below. If there has some bug, please point it, thanks!

import os
import numpy as np
from math import ceil
import soundfile as sf
from scipy import signal
from scipy.signal import get_window
from librosa.filters import mel

def butter_highpass(cutoff, fs, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = signal.butter(order, normal_cutoff, btype='high', analog=False)
return b, a

def pySTFT(x, fft_length=1024, hop_length=256):
x = np.pad(x, int(fft_length // 2), mode='reflect')
noverlap = fft_length - hop_length
shape = x.shape[:-1] + ((x.shape[-1] - noverlap) // hop_length, fft_length)
strides = x.strides[:-1] + (hop_length * x.strides[-1], x.strides[-1])
result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides)
fft_window = get_window('hann', fft_length, fftbins=True)
result = np.fft.rfft(fft_window * result, n=fft_length).T
return np.abs(result)

mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)

dirName = '../dataset/VCTK-Corpus/wav48'
subdir = 'p225'
fileName = 'p225_001.wav'
x, fs = sf.read(os.path.join(dirName, subdir, fileName))
y = signal.filtfilt(b, a, x)
wav = y
D = pySTFT(wav).T
D_mel = np.dot(D, mel_basis)
D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16
S = np.clip((D_db + 100) / 100, 0, 1)

print(S.shape)

@auspicious3000
Copy link
Owner

The sampling rate should be 16k instead of 48k

@gnipping
Copy link
Author

Thank you!

@gnipping gnipping reopened this Mar 31, 2021
@gnipping gnipping reopened this Mar 31, 2021
@gnipping
Copy link
Author

I have another question: I use the following code to replace the soundfile to read the data

x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)

However, the final dimension is (129, 80) still not the (90, 80)

@hongchengzhu
Copy link

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

@gnipping
Copy link
Author

gnipping commented Apr 9, 2021

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

@hongchengzhu
Copy link

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

@gnipping
Copy link
Author

gnipping commented Apr 9, 2021

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

I don not get the shape (90, 80), I get the shape (129, 80) instead. The complete code is replace the code "x, fs = sf.read(os.path.join(dirName, subdir, fileName))" to "x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)". And you will get my result if you use the same dataset downloaded from the link I had given.

@antovespoli3
Copy link

I get shape (129, 80) as well. Any update on this?

@auspicious3000
Copy link
Owner

The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.

@antovespoli3
Copy link

antovespoli3 commented Aug 10, 2021

Many thanks for your prompt reply. Unfortunately, I noticed that the audio quality is not as good. Is there any chance you used a particular procedure for downsampling to 16kHz? Or maybe you performed some preprocessing while downsampling?

Thanks

@auspicious3000
Copy link
Owner

No. and the procedures for downsampling should not make a big difference.

@antovespoli3
Copy link

antovespoli3 commented Aug 10, 2021

The reason why I thought about some additional preprocessing is that by analysing the spectrograms I noticed some differences between the original dataset and your version.

Below is the spectrogram that I computed starting from the original dataset, downsampling to 16kHz, and applying make_spect.py (shape 119*80)
a

Below is the spectrogram for p225_001 that you included in metadata.pkl (shape 90*80)
b

Below is the spectrogram that I computed starting from the file that you host on the demo page (https://auspicious3000.github.io/autovc-demo/audios/ground_truth1/p225_001.wav), downsampling to 16kHz (originally at 22050Hz), and applying make_spect.py (shape 90*80)
c

I don't understand why your files produce almost identical spectrograms, while if we start from the original dataset we get significantly different results.

The audio quality is affected as well:

"p225xp225 (8).wav" is the audio generated by the original dataset
"p225xp225 (7).wav" is the audio generated by the metadata.pkl in this repository

audio_files.zip

Do you have any idea of what could be the difference between your files and the files in the original dataset?

@antovespoli3
Copy link

I finally found that the difference is the trimming at the head and tail of the audio. I reproduced an almost identical file by "trimming it by hand", but I couldn't find the exact silence trimming procedure that you used.

@auspicious3000
Copy link
Owner

OK. That explains it. I trimmed the silence off by hand.

@MHVali
Copy link

MHVali commented Nov 9, 2022

OK. That explains it. I trimmed the silence off by hand.

@auspicious3000 You mean you trimmed the silence part off from whole VCTK dataset by hand to generate your training dataset and train the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants