Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More information about the pretrains' dataset. #1987

Open
codename0og opened this issue Apr 20, 2024 · 1 comment
Open

More information about the pretrains' dataset. #1987

codename0og opened this issue Apr 20, 2024 · 1 comment
Labels
documentation 📄文档说明 help wanted 🚸请求协助 question 💬信息不足

Comments

@codename0og
Copy link

@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing?
Was there anything applied?

  • denoising
  • peak / rms normalization
  • compression
    how about the dynamic range?

I am asking because, as much as I've done tons of models I still can't quite find anything useful in that regard based on my trainings;

A) Is it better to limit the dynamic range of the dataset to the possible maximum ( without distortion introduction ofc )
B) Maintaining it somewhat natural ( slight peaks taming + slight compression to even stuff out and then -2 or -3 db general norm )
B) Taking care of the harsher peaks / peaks in general but leaving the dynamic range alone

What kind of approach you think would be suitable for your pretrains?
I'd really benefit from such information, and I am pretty sure some other more advanced users too.
Thank you in advance!

@fumiama fumiama added documentation 📄文档说明 question 💬信息不足 help wanted 🚸请求协助 labels Apr 23, 2024
@SCRFilms
Copy link

@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing? Was there anything applied?

I just checked some samples in the vctk dataset and it's really bad.

-tons of mouth clicks
-loud mic noise
-low frequency rumbling noise (could be DC offset issue)
-lacks breathes sounds
-lacks pitch variations (speaker's pitches just sits about 110hz to 200hz)
-lacks higher harmonic details (causes it to have flipping harmonic and static harmonic artifacting)

I don't think they even apply processing to the audios, the dataset is also bad in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📄文档说明 help wanted 🚸请求协助 question 💬信息不足
Projects
None yet
Development

No branches or pull requests

3 participants