New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for flash_attn #4120
base: main
Are you sure you want to change the base?
Conversation
How about adding a Also, a quick note: you've forgotten to update the |
Yeah good idea, added. n.b. I feel like there's quite a bit of repeated parameter definitions throughout the server / client def that really could be pulled in from a single source of truth, but that's a problem for another day.
Whoops! Thanks, fixed now. |
👍 Hopefully this does get merged and not just left to die a painful "death-by-conflicts" like so many other PRs have already! I'm getting double the context on some of the bigger models using |
@bsdnet any chance of a approval? |
@sammcj API binding is missing in if opts.FlashAttn {
params = append(params, "-fa")
} Additionally, I'm not sure if it's the right time to introduce the |
3b249b4
to
ffb2e2a
Compare
Forgot to git add server.go after making an update, thanks @wanderingmeow, fixed now. I hear you re: the server params, what I've done is remove the additional params and defaulted to enabling flash_attn if CUDA or Metal is detected and added an option to explicitly disable it. |
Just wanted to point out that pre-Turing NVIDIA cards (CC < 7.0) don't support flash attention, as mentioned in ggerganov/llama.cpp/issues/7055. To avoid any issues or confusion, I think we should add a check before enabling it to ensure it's supported by the user's hardware: if opts.FlashAttn {
flashAttnSupported := (gpus[0].Library == "cuda" && gpus[0].Major >= 7 || gpus[0].Library == "metal")
if flashAttnSupported {
params = append(params, "--flash-attn")
} else {
slog.Warn("flash attention is not supported on your current hardware configuration, it is now disabled")
}
} Considering this feature is opt-in and with this check in place, |
That's a nice way of handling it! Thanks, I'm learning every day 😄. |
Working really well! 👍 |
Is there disadvantage? If not, should be enabled by default for everyone other than for the systems that can't support? |
While I agree, I feel like the most important thing for this right now is to get someone to approve it (as an optional feature) so folks can actually start using it and reporting their findings. Ollama maintainers / @bsdnet or maybe @jmorganca - if there's anything I can do to help get this PR moving along please do let me know. |
Looks like it needs approval to allow the workflow to run. |
05407fd
to
4bbd583
Compare
@dhiltgen any chance of your eyes on this one? |
Cleaned up logic and rebased. |
I'm all ears if anyone in the Ollama contributors thinks the PR needs improvements, please just let me know @jmorganca or @dhiltgen? This will resolve #4051. I'm somewhat afraid this PR will sit there if I don't keep nagging. I'm sure folks are just very busy but given the number of open PRs I suspect that the project would benefit from embracing some additional automation around the review and merge process of features/fixes, especially if continuing to maintain a partial fork of llama.cpp's server. |
@jmorganca great, thanks! Let me know if you want me to update or remove the bump of the submodule, otherwise I'm guessing updating from main once your other PR is merged should do it. |
Daily reminder about this PR @jmorganca 🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs one more rebase
@jmorganca done :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second look I'm seeing some strange results on metal
with partial offloading. I'm not sure if this might be fixed in more recent commit of llama.cpp, but it spits out strange characters:
% ./ollama run llama3
>>> /set parameter num_gpu 10
Set parameter 'num_gpu' to '10'
>>> hi
.8;!177'1G.DC3"6C,64;B!H/G-<392^C
We might need to only enable this on fully loaded models
(Also: thanks so much for rebasing this quite a few times) |
@jmorganca I haven't noticed this! But to be safe, I just pushed an update to ensure it's disabled if we're using the CPU runner. |
I know this is late for this pull request, but can you add a new env var so people can force the use of flash_attn? |
@dpublic I think to avoid dragging this PR on I'd rather just leave it as is (auto-enabled). I did originally have this, however the preference was to simply enable it if supported until such time as Ollama implements a proper config file / dotenv combo (which would be great) 🤞 as there's a lot of settings both for Ollama and llama.cpp which would benefit from a centralised configuration. |
ping @jmorganca |
Only enabled by default on a supported CUDA version or Metal is detected, configurable via params and the API.
Credit to @wanderingmeow who took my broken idea and made it work 🎉
Fixes #4051