You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing?
Was there anything applied?
denoising
peak / rms normalization
compression
how about the dynamic range?
I am asking because, as much as I've done tons of models I still can't quite find anything useful in that regard based on my trainings;
A) Is it better to limit the dynamic range of the dataset to the possible maximum ( without distortion introduction ofc )
B) Maintaining it somewhat natural ( slight peaks taming + slight compression to even stuff out and then -2 or -3 db general norm )
B) Taking care of the harsher peaks / peaks in general but leaving the dynamic range alone
What kind of approach you think would be suitable for your pretrains?
I'd really benefit from such information, and I am pretty sure some other more advanced users too.
Thank you in advance!
The text was updated successfully, but these errors were encountered:
@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing? Was there anything applied?
I just checked some samples in the vctk dataset and it's really bad.
-tons of mouth clicks
-loud mic noise
-low frequency rumbling noise (could be DC offset issue)
-lacks breathes sounds
-lacks pitch variations (speaker's pitches just sits about 110hz to 200hz)
-lacks higher harmonic details (causes it to have flipping harmonic and static harmonic artifacting)
I don't think they even apply processing to the audios, the dataset is also bad in the first place.
@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing?
Was there anything applied?
how about the dynamic range?
I am asking because, as much as I've done tons of models I still can't quite find anything useful in that regard based on my trainings;
A) Is it better to limit the dynamic range of the dataset to the possible maximum ( without distortion introduction ofc )
B) Maintaining it somewhat natural ( slight peaks taming + slight compression to even stuff out and then -2 or -3 db general norm )
B) Taking care of the harsher peaks / peaks in general but leaving the dynamic range alone
What kind of approach you think would be suitable for your pretrains?
I'd really benefit from such information, and I am pretty sure some other more advanced users too.
Thank you in advance!
The text was updated successfully, but these errors were encountered: