-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FA2 - P40 || Mixtral partial GPU offload Gibberish #7400
Comments
Running it with |
Is the commit you linked specifically the one at which the bug is introduced or is that simply the one that you tested? |
Its the commit i tested with. Commit 0fc1e82 has the same behavior too. |
I'm building with If i build with MMQ=0 then the gibberish behavior doesnt happen with mixtral. However: So its something specific with MoE + FA + MMQ + Partial offload thats causing the gibbish output after a 2nd reply, or when some context is close to full. Additional Info: |
I have LLAMA_CUDA_FORCE_MMQ ON for my 1070 in #7401because it is significantly faster on the 1070, but off for both the 4070s. |
@askmyteapot It's fixed? Do you know which PR fixed it? |
Sorry, i meant to only close the KoboldCPP one. I'm testing now. |
Ok, i can no-longer replicate this issue. |
Copied from LostRuins#854 but with additional testing for llama.cpp specifically
Discovered a bug with the following conditions:
Commit: 1ea2a00
OS: Win 11
Cuda: 12.4
CPU: Ryzen 5800x
RAM: 64GB DDR4
GPU0: RTX 3060ti [not being used for koboldcpp]
GPU1: Tesla P40
Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral)
GPU offload: Partial (28/33 layers)
Max Context: 8192
Flash Attention: True
What happens?
Load a long context chat in Silly Tavern thats greater than the max ctx OR
New chat, first response normal. Second response gibberish
Sometimes crashes with the below error
Outputs:
enyprocess startup Tamb轻 access minutes ==>MBER)). enemscribpeedIntentelyindices обе modifynextabor중unt Long cousin Javaа feasUnityEngine Clark loader CharlotteAllowthing Ameraut luego境 Sout capture submarom helyenasjarinterpretibility press Leop Susan estim '% fistправ son dating tonight allocated PomController)$ября forceife Adm레 hoping logged heroRunaju.]widget reduces wattechn traders Nik Domingenerator ability assigned Hey AV Properties deputuvud Jacques
Works:
Everything without Flash attention enabled
Full GPU offload (could only test the L3 mixtral for this)
Non-mixtral Full offload
Non-mixtral Partial offload.
The text was updated successfully, but these errors were encountered: