New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051
Comments
Are there any cons for using Flash Attention? If not it probably should be the default. |
Also if there is more memory available we should really set the default quants to Q4_K_M and fallback to Q4 when Q4_K_M is not available |
Yeah, I think this would help a lot. |
Just been looking at the source to see if this can be done: https://github.com/ollama/ollama/blob/main/llm/ext_server/server.cpp and even though it's importing the latest Some of the default values (eg: --repeat-penalty) and even meanings (eg: --batch) are also behind because of this. I guess we could try to make a diff to see if this can be updated, but since PRs never seem to get accepted and random stuff constantly seems to change regarding the passing of command line options, I'm reluctant to even try... |
I had a hot crack at a PR tonight but as you said @jukofyork it seems server.cpp has diverged from llama.cpp a lot and I couldn't get my head around it. Might have to wait for someone more familiar with Ollama's version of it / more brains than me. |
FYI: LM Studio added Flash Attention earlier today: https://www.reddit.com/r/LocalLLaMA/comments/1cir98j/lm_studio_released_new_version_with_flash/ |
Comparing GGUF performance with/without Flash Attention Hardware
Model
Request
LM StudioWith Flash Attention
OllamaWithout Flash Attention
|
That looks VERY similar to what I tried @wanderingmeow, perhaps I just wasn't parsing the parameters correctly 🤔 , that's awesome! |
That worked instantly, still no where near as fast as LM Studio but it's a start.
|
@wanderingmeow I've created a PR with the changes - #4120 |
Awesome! 👍 |
I wonder if we can extend this:
To use a reg-ex to match against the model name passed in the Then we can just completely bypass the modelfile code and pass model-specific parameters as well? This would save all the hassle of PRs not getting accepted and then breaking because the code all got moved around! It looks like we could just copy in It probably doesn't want to be split using "," though as that is used by |
Something like:
Although I'm not sure if the |
From my testing, the current ollama |
@jukofyork I had a look at this, but it was getting messy, fast. What I think probably could work is:
That's way out of my ball park though 😅 |
It really helps reduce the VRAM use of long-context models: some I could only run at 16k or 32k are now running with 32k or 64k for the same quant! |
Yeah, but it would be good for the long-run: some of the parameters and their default settings that Ollama is using are very out of date compared to llama.cpp's server now and I can only see things getting worse and worse if nothing is done about it. |
Anyway, #4120 provides the functionality for now, just waiting on someone to approve it... |
#4120 is still sitting waiting for approval to be merged, I've been trying to keep it up and fix conflicts etc as other PRs are merged in. |
Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).
The tldr; is simply to pass the -fa flag to llama.cpp’s server.
also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g.
OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else
The text was updated successfully, but these errors were encountered: