How to get the same mel feature in "metadata.pkl"? #84

gnipping · 2021-03-30T14:04:03Z

I use your default parameter and code to compute the mel feature of "p225_001.wav" in VCTK corpus. However, I get the dimension of mel feature is (385, 80) not the dimension (90 ,80) in "metadata.pkl". Do you have extra processing steps?

auspicious3000 · 2021-03-30T22:10:32Z

No I didn't.

gnipping · 2021-03-31T01:13:54Z

So why the first dimension is not the same? And I use the mel feature whose feature is (385, 80) and your model, your speaker embedding in "metadata.pkl" to generate the audio "p225xp228", but only generate 6s strange voice. I cannot hear the word "please call stella". So how you reduce the dimension from 385 to 90?

auspicious3000 · 2021-03-31T01:41:42Z

The first dim is the number of frames. There is no dimension reduction. It should be around 90 for that utterance. Please double-check your code.

gnipping · 2021-03-31T04:20:14Z

I use your code and your parameter in issue #4 to generate the mel feature, the hop_size is 256 and the result dimension is (385, 80). The code is below. If there has some bug, please point it, thanks!

import os
import numpy as np
from math import ceil
import soundfile as sf
from scipy import signal
from scipy.signal import get_window
from librosa.filters import mel

def butter_highpass(cutoff, fs, order=5):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
b, a = signal.butter(order, normal_cutoff, btype='high', analog=False)
return b, a

def pySTFT(x, fft_length=1024, hop_length=256):
x = np.pad(x, int(fft_length // 2), mode='reflect')
noverlap = fft_length - hop_length
shape = x.shape[:-1] + ((x.shape[-1] - noverlap) // hop_length, fft_length)
strides = x.strides[:-1] + (hop_length * x.strides[-1], x.strides[-1])
result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides)
fft_window = get_window('hann', fft_length, fftbins=True)
result = np.fft.rfft(fft_window * result, n=fft_length).T
return np.abs(result)

mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)

dirName = '../dataset/VCTK-Corpus/wav48'
subdir = 'p225'
fileName = 'p225_001.wav'
x, fs = sf.read(os.path.join(dirName, subdir, fileName))
y = signal.filtfilt(b, a, x)
wav = y
D = pySTFT(wav).T
D_mel = np.dot(D, mel_basis)
D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16
S = np.clip((D_db + 100) / 100, 0, 1)

print(S.shape)

auspicious3000 · 2021-03-31T04:31:20Z

The sampling rate should be 16k instead of 48k

gnipping · 2021-03-31T04:36:55Z

Thank you!

gnipping · 2021-03-31T04:56:13Z

I have another question: I use the following code to replace the soundfile to read the data

x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)

However, the final dimension is (129, 80) still not the (90, 80)

hongchengzhu · 2021-04-09T02:22:10Z

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

gnipping · 2021-04-09T02:32:26Z

Thank you!

     Hello, I met the same question as you. I'd like to generate my own "metadata.pkl" file to convert the voice from the little training example provided by the author (e.g. "\wavs\p225\p225_003.wav"), thus I tried to use "make_spect.py" to generate speech Mel-spectrogram by myself. However, my result is "(376, 80)", not "(90, 80)".
     I have noticed that you asked the author the same question and got the answer as "change the sampling rate from 48k to 16k". However, your code parameters use 16k as sr, not 48k, which I feel confused and I'd like to know how you solve that issue?
     Thank you very much!

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

hongchengzhu · 2021-04-09T03:09:54Z

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

gnipping · 2021-04-09T03:19:11Z

This is because I use the dataset VCTK corpus download from https://datashare.ed.ac.uk/handle/10283/2950. In there, I do not find the sound whose frequency is 16K, so I use the sound frequency 48K, and use that code to read it.

Do you mean that you used the VCTK dataset whose sr is 48k, then you downsampled them to 16k, by which you get the (90, 80) Mel-spectrogram to convert? Well, that's very strange, the author's provided little training samples(in "\wav\p225\p225_003.wav") owns 16k sr, but get the (376, 80).

And further, have you ever tried to converted voice by your own "metadata.pkl"? If you have, could you please give me some advice? I'm a freshman in VC thus don't know much about how to build my model. Thank you again!

I don not get the shape (90, 80), I get the shape (129, 80) instead. The complete code is replace the code "x, fs = sf.read(os.path.join(dirName, subdir, fileName))" to "x, fs = librosa.load(os.path.join(dirName, subdir, fileName), sr=16000)". And you will get my result if you use the same dataset downloaded from the link I had given.

antovespoli3 · 2021-08-10T02:53:03Z

I get shape (129, 80) as well. Any update on this?

auspicious3000 · 2021-08-10T03:24:16Z

The length does not have to be 90. As long as the sampling frequency is correct, it should be fine.

antovespoli3 · 2021-08-10T03:30:03Z

Many thanks for your prompt reply. Unfortunately, I noticed that the audio quality is not as good. Is there any chance you used a particular procedure for downsampling to 16kHz? Or maybe you performed some preprocessing while downsampling?

Thanks

auspicious3000 · 2021-08-10T03:34:39Z

No. and the procedures for downsampling should not make a big difference.

antovespoli3 · 2021-08-10T05:42:48Z

The reason why I thought about some additional preprocessing is that by analysing the spectrograms I noticed some differences between the original dataset and your version.

Below is the spectrogram that I computed starting from the original dataset, downsampling to 16kHz, and applying make_spect.py (shape 119*80)

Below is the spectrogram for p225_001 that you included in metadata.pkl (shape 90*80)

Below is the spectrogram that I computed starting from the file that you host on the demo page (https://auspicious3000.github.io/autovc-demo/audios/ground_truth1/p225_001.wav), downsampling to 16kHz (originally at 22050Hz), and applying make_spect.py (shape 90*80)

I don't understand why your files produce almost identical spectrograms, while if we start from the original dataset we get significantly different results.

The audio quality is affected as well:

"p225xp225 (8).wav" is the audio generated by the original dataset
"p225xp225 (7).wav" is the audio generated by the metadata.pkl in this repository

audio_files.zip

Do you have any idea of what could be the difference between your files and the files in the original dataset?

antovespoli3 · 2021-08-10T16:58:47Z

I finally found that the difference is the trimming at the head and tail of the audio. I reproduced an almost identical file by "trimming it by hand", but I couldn't find the exact silence trimming procedure that you used.

auspicious3000 · 2021-08-10T17:49:57Z

OK. That explains it. I trimmed the silence off by hand.

MHVali · 2022-11-09T07:18:57Z

OK. That explains it. I trimmed the silence off by hand.

@auspicious3000 You mean you trimmed the silence part off from whole VCTK dataset by hand to generate your training dataset and train the model?

gnipping closed this as completed Mar 31, 2021

gnipping reopened this Mar 31, 2021

gnipping closed this as completed Mar 31, 2021

gnipping reopened this Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the same mel feature in "metadata.pkl"? #84

How to get the same mel feature in "metadata.pkl"? #84

gnipping commented Mar 30, 2021

auspicious3000 commented Mar 30, 2021

gnipping commented Mar 31, 2021

auspicious3000 commented Mar 31, 2021

gnipping commented Mar 31, 2021 •

edited

auspicious3000 commented Mar 31, 2021

gnipping commented Mar 31, 2021

gnipping commented Mar 31, 2021

hongchengzhu commented Apr 9, 2021

gnipping commented Apr 9, 2021

hongchengzhu commented Apr 9, 2021

gnipping commented Apr 9, 2021

antovespoli3 commented Aug 10, 2021

auspicious3000 commented Aug 10, 2021

antovespoli3 commented Aug 10, 2021 •

edited

auspicious3000 commented Aug 10, 2021

antovespoli3 commented Aug 10, 2021 •

edited

antovespoli3 commented Aug 10, 2021

auspicious3000 commented Aug 10, 2021

MHVali commented Nov 9, 2022

How to get the same mel feature in "metadata.pkl"? #84

How to get the same mel feature in "metadata.pkl"? #84

Comments

gnipping commented Mar 30, 2021

auspicious3000 commented Mar 30, 2021

gnipping commented Mar 31, 2021

auspicious3000 commented Mar 31, 2021

gnipping commented Mar 31, 2021 • edited

auspicious3000 commented Mar 31, 2021

gnipping commented Mar 31, 2021

gnipping commented Mar 31, 2021

hongchengzhu commented Apr 9, 2021

gnipping commented Apr 9, 2021

hongchengzhu commented Apr 9, 2021

gnipping commented Apr 9, 2021

antovespoli3 commented Aug 10, 2021

auspicious3000 commented Aug 10, 2021

antovespoli3 commented Aug 10, 2021 • edited

auspicious3000 commented Aug 10, 2021

antovespoli3 commented Aug 10, 2021 • edited

antovespoli3 commented Aug 10, 2021

auspicious3000 commented Aug 10, 2021

MHVali commented Nov 9, 2022

gnipping commented Mar 31, 2021 •

edited

antovespoli3 commented Aug 10, 2021 •

edited

antovespoli3 commented Aug 10, 2021 •

edited