Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding batches to training? #43

Open
Many0therFunctions opened this issue Oct 25, 2023 · 0 comments
Open

adding batches to training? #43

Many0therFunctions opened this issue Oct 25, 2023 · 0 comments

Comments

@Many0therFunctions
Copy link

Many0therFunctions commented Oct 25, 2023

Am I correct in saying the training code in customtokenizer only trains one X Y pair at a time instead of a whole batch at once?

Are there any plans to add batches to the training code so it can process a large batch at once? As it stands right now, when combining english, german, polish, japanese, and portugese datasets from huggingface, it is taking about 1 hour per epoch and only 3 out of 8GB of VRAM is used.

(Obviously this doesn't run on google colab since trying to load 32,000+ files crashes google drive AND runs out of instance RAM on colab, but if it could, batching would be a very nice idea so that one epoch could be done in maybe a hundred steps or something instead of .. .tens of thousands. It seems to be trying to fit to one set of data but that's conflicting with another set of data, basically to rephrase that in plain English, is it tries to learn one feature and gets worse at the other, it corrects the other and gets worse at the first, whereas I think if it was all batched it would "see the bigger picture" and "all the patterns as related and part of a whole" or something like that. )

(But I dunno though, maybe this architecture is insufficient for OMNI-LINGUAL and can at best only learn languages in a group like traditional linguists define. Romance languages, Indo.. uhh.. something languages.... I say it might be like that because just last night I tried using the english 23 epoch model as a pretrain starting point, and well... 8hrs later, at 8 epochs, it sort of can map an unsupported language like vietnamese. Approximates alot of words at the wrong "notes" but it did better than expected in THAT regard, so the theory is not too far off. Where it fucked up is suddenly some speakers english words, said with an accent turned into a Russian or Polish phoneme, which really makes me wonder if there's a limit to how "different" the languages can be, but still, way off topic here, I think batched training would really help with all this. )

(But the important thing here is it seems bark DOES have the ability to generate the correct phonemes for novel sounds if you can just tease out the right semantic tokens, which you can get really close by having the quantizer hybridize the languages your target language is closest to.... but ghyaaah thats such a pain in the ass to do for every language)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant