feat: add support for flash_attn #4120

sammcj · 2024-05-03T07:28:07Z

Add Flash Attention support Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Only enabled by default on a supported CUDA version or Metal is detected, configurable via params and the API.

Credit to @wanderingmeow who took my broken idea and made it work 🎉

llm/ext_server/server.cpp

wanderingmeow · 2024-05-04T04:11:47Z

How about adding a flash_attn flag to the Runner struct, so it can be set via /set parameter flash_attn true or API call. Not sure if this would be an API-breaking change, but I think it's worth considering. Currently, flash attention doesn't work for CPU or pre-Tensor cores GPU inference, it might not be desirable to enable it by default.

Also, a quick note: you've forgotten to update the llama.cpp dependency.

sammcj · 2024-05-04T06:01:57Z

How about adding a flash_attn flag to the Runner struct, so it can be set via /set parameter flash_attn true or API call. Not sure if this would be an API-breaking change, but I think it's worth considering. Currently, flash attention doesn't work for CPU or pre-Tensor cores GPU inference, it might not be desirable to enable it by default.

Yeah good idea, added.

n.b. I feel like there's quite a bit of repeated parameter definitions throughout the server / client def that really could be pulled in from a single source of truth, but that's a problem for another day.

Also, a quick note: you've forgotten to update the llama.cpp dependency.

Whoops! Thanks, fixed now.

jukofyork · 2024-05-04T14:07:59Z

👍

Hopefully this does get merged and not just left to die a painful "death-by-conflicts" like so many other PRs have already! ☹️

I'm getting double the context on some of the bigger models using llama.cpp server directly!

sammcj · 2024-05-04T21:23:34Z

@bsdnet any chance of a approval?

wanderingmeow · 2024-05-05T10:36:40Z

@sammcj API binding is missing in server.go. Specifically, the flag -fa is not being passed into server.cpp when opts.FlashAttn is set.

if opts.FlashAttn {
    params = append(params, "-fa")
}

Additionally, I'm not sure if it's the right time to introduce the OLLAMA_LLAMA_EXTRA_ARGS environment variable. One concern is that it doesn't handle parameter overrides, and some flags don't have a disable equivalent (e.g., -fa enables flash attention but there's no --no-flash-attn to disable it). Maybe we should wait until after the revamp of server.cpp, especially server_params_parse(), or consider calling C++ directly in Go without spawning another process?

sammcj · 2024-05-05T12:13:36Z

Forgot to git add server.go after making an update, thanks @wanderingmeow, fixed now.

I hear you re: the server params, what I've done is remove the additional params and defaulted to enabling flash_attn if CUDA or Metal is detected and added an option to explicitly disable it.

wanderingmeow · 2024-05-05T14:34:40Z

Just wanted to point out that pre-Turing NVIDIA cards (CC < 7.0) don't support flash attention, as mentioned in ggerganov/llama.cpp/issues/7055.

To avoid any issues or confusion, I think we should add a check before enabling it to ensure it's supported by the user's hardware:

if opts.FlashAttn {
    flashAttnSupported := (gpus[0].Library == "cuda" && gpus[0].Major >= 7 || gpus[0].Library == "metal")
    if flashAttnSupported {
        params = append(params, "--flash-attn")
    } else {
        slog.Warn("flash attention is not supported on your current hardware configuration, it is now disabled")
    }
}

Considering this feature is opt-in and with this check in place, --disable-flash-attention flag in server.cpp can be removed.

sammcj · 2024-05-05T21:33:03Z

That's a nice way of handling it! Thanks, I'm learning every day 😄.
PR updated.

jukofyork · 2024-05-06T14:12:36Z

Working really well! 👍

chigkim · 2024-05-06T21:25:30Z

Is there disadvantage? If not, should be enabled by default for everyone other than for the systems that can't support?

sammcj · 2024-05-06T21:47:00Z

Is there disadvantage? If not, should be enabled by default for everyone other than for the systems that can't support?

While I agree, I feel like the most important thing for this right now is to get someone to approve it (as an optional feature) so folks can actually start using it and reporting their findings.

Ollama maintainers / @bsdnet or maybe @jmorganca - if there's anything I can do to help get this PR moving along please do let me know.

sammcj · 2024-05-07T20:48:47Z

Looks like it needs approval to allow the workflow to run.

sammcj · 2024-05-08T23:42:04Z

@dhiltgen any chance of your eyes on this one?

api/types.go

llm/server.go

sammcj · 2024-05-09T22:03:21Z

Cleaned up logic and rebased.

sammcj · 2024-05-11T02:55:39Z

I'm all ears if anyone in the Ollama contributors thinks the PR needs improvements, please just let me know @jmorganca or @dhiltgen?

This will resolve #4051.

I'm somewhat afraid this PR will sit there if I don't keep nagging. I'm sure folks are just very busy but given the number of open PRs I suspect that the project would benefit from embracing some additional automation around the review and merge process of features/fixes, especially if continuing to maintain a partial fork of llama.cpp's server.

jmorganca · 2024-05-14T00:18:29Z

Hi @sammcj this is close to merging, I'm just going to update the llama.cpp submodule in another PR since we can remove a temporary patch #4414

sammcj · 2024-05-14T01:05:56Z

@jmorganca great, thanks!

Let me know if you want me to update or remove the bump of the submodule, otherwise I'm guessing updating from main once your other PR is merged should do it.

sammcj · 2024-05-15T10:13:21Z

Daily reminder about this PR @jmorganca 🤣

jmorganca

Needs one more rebase

sammcj · 2024-05-16T22:13:09Z

@jmorganca done :)

jmorganca

On second look I'm seeing some strange results on metal with partial offloading. I'm not sure if this might be fixed in more recent commit of llama.cpp, but it spits out strange characters:

% ./ollama run llama3
>>> /set parameter num_gpu 10
Set parameter 'num_gpu' to '10'
>>> hi
.8;!177'1G.DC3"6C,64;B!H/G-<392^C

We might need to only enable this on fully loaded models

jmorganca · 2024-05-16T23:36:49Z

(Also: thanks so much for rebasing this quite a few times)

sammcj · 2024-05-16T23:43:36Z

@jmorganca I haven't noticed this!

But to be safe, I just pushed an update to ensure it's disabled if we're using the CPU runner.

llm/server.go

dpublic · 2024-05-17T16:56:46Z

I know this is late for this pull request, but can you add a new env var so people can force the use of flash_attn?
This would be a way for people to try it on cpu as the problems noted could be model-related.
This would be similar to how OLLAMA_NUM_PARALLEL sets a parameter.

sammcj · 2024-05-17T21:57:48Z

I know this is late for this pull request, but can you add a new env var so people can force the use of flash_attn? This would be a way for people to try it on cpu as the problems noted could be model-related. This would be similar to how OLLAMA_NUM_PARALLEL sets a parameter.

@dpublic I think to avoid dragging this PR on I'd rather just leave it as is (auto-enabled).

I did originally have this, however the preference was to simply enable it if supported until such time as Ollama implements a proper config file / dotenv combo (which would be great) 🤞 as there's a lot of settings both for Ollama and llama.cpp which would benefit from a centralised configuration.

sammcj · 2024-05-20T06:19:21Z

ping @jmorganca

sammcj mentioned this pull request May 3, 2024

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Open

sammcj force-pushed the main branch from 5e36c4c to 5293ce2 Compare May 3, 2024 07:37

bsdnet reviewed May 3, 2024

View reviewed changes

llm/ext_server/server.cpp Show resolved Hide resolved

sammcj requested a review from bsdnet May 3, 2024 23:29

sammcj force-pushed the main branch from a643c1b to 3024414 Compare May 4, 2024 06:00

sammcj force-pushed the main branch 3 times, most recently from 3b249b4 to ffb2e2a Compare May 5, 2024 12:13

sammcj changed the title ~~feat: option to parse args to llama.cpp, support flash_attn~~ feat: add support for flash_attn May 5, 2024

sammcj force-pushed the main branch from 9499526 to 34136d4 Compare May 5, 2024 12:23

sammcj force-pushed the main branch from 34136d4 to 8855e2c Compare May 5, 2024 21:32

sammcj force-pushed the main branch from 8855e2c to e400b1a Compare May 5, 2024 21:33

sammcj mentioned this pull request May 5, 2024

Consider Using Standard Config Format #204

Open

bsdnet approved these changes May 7, 2024

View reviewed changes

sammcj force-pushed the main branch 2 times, most recently from 05407fd to 4bbd583 Compare May 8, 2024 21:15

jmorganca reviewed May 9, 2024

View reviewed changes

api/types.go Outdated Show resolved Hide resolved

dhiltgen reviewed May 9, 2024

View reviewed changes

llm/server.go Outdated Show resolved Hide resolved

sammcj requested review from dhiltgen and bsdnet May 9, 2024 21:50

sammcj force-pushed the main branch from a04350a to 860813c Compare May 9, 2024 22:02

jmorganca approved these changes May 16, 2024

View reviewed changes

feat: enable flash attention if supported

7857efa

sammcj force-pushed the main branch from 5ab0d7b to 7857efa Compare May 16, 2024 22:12

jmorganca approved these changes May 16, 2024

View reviewed changes

jmorganca requested changes May 16, 2024

View reviewed changes

feat: enable flash attention if supported

f8dbbee

sammcj requested a review from jmorganca May 16, 2024 23:44

Merge branch 'main' into main

02c6e37

jmorganca reviewed May 17, 2024

View reviewed changes

llm/server.go Outdated Show resolved Hide resolved

feat: enable flash attention if supported

69474d1

sammcj requested a review from jmorganca May 17, 2024 07:45

sammcj added 2 commits May 18, 2024 07:39

Merge branch 'main' into main

e9ffccb

feat: add flash_attn support

0c08573

sammcj added 2 commits May 19, 2024 09:01

Merge branch 'main' into main

9b134a9

Merge branch 'main' into main

24c4dae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for flash_attn #4120

feat: add support for flash_attn #4120

sammcj commented May 3, 2024 •

edited

wanderingmeow commented May 4, 2024

sammcj commented May 4, 2024

jukofyork commented May 4, 2024 •

edited

sammcj commented May 4, 2024

wanderingmeow commented May 5, 2024

sammcj commented May 5, 2024 •

edited

wanderingmeow commented May 5, 2024

sammcj commented May 5, 2024 •

edited

jukofyork commented May 6, 2024

chigkim commented May 6, 2024

sammcj commented May 6, 2024 •

edited

sammcj commented May 7, 2024 •

edited

sammcj commented May 8, 2024

sammcj commented May 9, 2024

sammcj commented May 11, 2024 •

edited

jmorganca commented May 14, 2024

sammcj commented May 14, 2024

sammcj commented May 15, 2024 •

edited

jmorganca left a comment

sammcj commented May 16, 2024

jmorganca left a comment

jmorganca commented May 16, 2024 •

edited

sammcj commented May 16, 2024 •

edited

dpublic commented May 17, 2024 •

edited

sammcj commented May 17, 2024

sammcj commented May 20, 2024

feat: add support for flash_attn #4120

Are you sure you want to change the base?

feat: add support for flash_attn #4120

Conversation

sammcj commented May 3, 2024 • edited

wanderingmeow commented May 4, 2024

sammcj commented May 4, 2024

jukofyork commented May 4, 2024 • edited

sammcj commented May 4, 2024

wanderingmeow commented May 5, 2024

sammcj commented May 5, 2024 • edited

wanderingmeow commented May 5, 2024

sammcj commented May 5, 2024 • edited

jukofyork commented May 6, 2024

chigkim commented May 6, 2024

sammcj commented May 6, 2024 • edited

sammcj commented May 7, 2024 • edited

sammcj commented May 8, 2024

sammcj commented May 9, 2024

sammcj commented May 11, 2024 • edited

jmorganca commented May 14, 2024

sammcj commented May 14, 2024

sammcj commented May 15, 2024 • edited

jmorganca left a comment

Choose a reason for hiding this comment

sammcj commented May 16, 2024

jmorganca left a comment

Choose a reason for hiding this comment

jmorganca commented May 16, 2024 • edited

sammcj commented May 16, 2024 • edited

dpublic commented May 17, 2024 • edited

sammcj commented May 17, 2024

sammcj commented May 20, 2024

sammcj commented May 3, 2024 •

edited

jukofyork commented May 4, 2024 •

edited

sammcj commented May 5, 2024 •

edited

sammcj commented May 5, 2024 •

edited

sammcj commented May 6, 2024 •

edited

sammcj commented May 7, 2024 •

edited

sammcj commented May 11, 2024 •

edited

sammcj commented May 15, 2024 •

edited

jmorganca commented May 16, 2024 •

edited

sammcj commented May 16, 2024 •

edited

dpublic commented May 17, 2024 •

edited