Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

sammcj · 2024-04-30T13:06:47Z

Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.

Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

The text was updated successfully, but these errors were encountered:

DuckyBlender · 2024-04-30T23:32:26Z

Are there any cons for using Flash Attention? If not it probably should be the default.

DuckyBlender · 2024-04-30T23:34:22Z

Also if there is more memory available we should really set the default quants to Q4_K_M and fallback to Q4 when Q4_K_M is not available

jukofyork · 2024-05-01T02:11:37Z

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Yeah, I think this would help a lot.

jukofyork · 2024-05-02T09:50:10Z

Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.
* Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?
also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Just been looking at the source to see if this can be done:

https://github.com/ollama/ollama/blob/main/llm/ext_server/server.cpp

and even though it's importing the latest gpt_params structure (from the lamma.cpp submodule dated 2 days ago), all the parsing is using some server_params_parse function stripped from a outdated version of the llama.cpp server.

Some of the default values (eg: --repeat-penalty) and even meanings (eg: --batch) are also behind because of this.

I guess we could try to make a diff to see if this can be updated, but since PRs never seem to get accepted and random stuff constantly seems to change regarding the passing of command line options, I'm reluctant to even try...

sammcj · 2024-05-02T13:28:37Z

I had a hot crack at a PR tonight but as you said @jukofyork it seems server.cpp has diverged from llama.cpp a lot and I couldn't get my head around it.

Might have to wait for someone more familiar with Ollama's version of it / more brains than me.

sammcj · 2024-05-03T03:21:34Z

FYI: LM Studio added Flash Attention earlier today: https://www.reddit.com/r/LocalLLaMA/comments/1cir98j/lm_studio_released_new_version_with_flash/

sammcj · 2024-05-03T04:14:40Z

Comparing GGUF performance with/without Flash Attention

Hardware

Apple Macbook Pro M2 Max (96GB)

Model

llama3-bartowski-8b-instruct-q8-0.gguf

Request

curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0",
        "messages": [
            {
                "role": "user",
                "content": "Tell me two short jokes."
            }
        ]
    }'

LM Studio

With Flash Attention

Start time: 14:10:20.401
End time: 14:10:21.636
"prompt_tokens": 38,
"completion_tokens": 37,
"total_tokens": 75

1.235 seconds~
60.65 tokens/s~

Ollama

Without Flash Attention

total duration:       1.866343375s
load duration:        1.299853709s
prompt eval count:    15 token(s)
prompt eval duration: 84.22ms
prompt eval rate:     178.10 tokens/s
eval count:           18 token(s)
eval duration:        481.335ms
eval rate:            37.40 tokens/s

Start TIMESTAMP: 1714709035
End TIMESTAMP: 1714709036

1.866 seconds
40.38 tokens/s

wanderingmeow · 2024-05-03T07:06:35Z

Got it working by adding the OLLAMA_LLAMA_EXTRA_ARGS environment variable in llm/server.go as @sammcj suggested. This allows the -fa flag be passed into ext_server/server.cpp. If anyone is interested, my hack can be found here. Hopefully this can help others get Flash Attention working too.

sammcj · 2024-05-03T07:10:20Z

That looks VERY similar to what I tried @wanderingmeow, perhaps I just wasn't parsing the parameters correctly 🤔 , that's awesome!

sammcj · 2024-05-03T07:20:37Z

That worked instantly, still no where near as fast as LM Studio but it's a start.

export OLLAMA_LLAMA_EXTRA_ARGS="-fa"

ollama serve

...
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
...

ollama run registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0 'tell me two short jokes' --verbose
Here are two short jokes:

1. Why don't scientists trust atoms? Because they make up everything!
2. Why don't eggs tell jokes? They'd crack each other up!

Hope you find them amusing!

total duration:       2.539137625s
load duration:        1.286854542s
prompt eval count:    15 token(s)
prompt eval duration: 83.363ms
prompt eval rate:     179.94 tokens/s
eval count:           44 token(s)
eval duration:        1.168093s
eval rate:            37.67 tokens/s

sammcj · 2024-05-03T07:28:25Z

@wanderingmeow I've created a PR with the changes - #4120

jukofyork · 2024-05-03T07:37:00Z

Awesome! 👍

jukofyork · 2024-05-03T08:07:22Z

I wonder if we can extend this:

	if other_args := os.Getenv("OLLAMA_LLAMA_EXTRA_ARGS"); other_args != "" {
		params = append(params, strings.Split(other_args, ",")...)
	}

To use a reg-ex to match against the model name passed in the model string?

Then we can just completely bypass the modelfile code and pass model-specific parameters as well? This would save all the hassle of PRs not getting accepted and then breaking because the code all got moved around!

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

It probably doesn't want to be split using "," though as that is used by llama.cpp for some options like tensor-split, etc.

jukofyork · 2024-05-03T08:16:27Z

Something like:

export OLLAMA_LLAMA_EXTRA_ARGS=".* ; -fa | qwen.* ; -sm row ; -ts 3,2 | llama2.* ; --rope-freq-base 8192"

Although I'm not sure if the model string will actually be the name or the weird sha hash name that points to the GGUF file?

wanderingmeow · 2024-05-03T08:23:45Z

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

sammcj · 2024-05-03T09:07:29Z

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:

Split out Ollama's custom server configuration from the model server parameters.
Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).
Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.

That's way out of my ball park though 😅

jukofyork · 2024-05-04T11:55:38Z

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

It really helps reduce the VRAM use of long-context models: some I could only run at 16k or 32k are now running with 32k or 64k for the same quant!

jukofyork · 2024-05-04T12:00:51Z

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:
1. Split out Ollama's custom server configuration from the model server parameters.

2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).

3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.
That's way out of my ball park though 😅

Yeah, but it would be good for the long-run: some of the parameters and their default settings that Ollama is using are very out of date compared to llama.cpp's server now and I can only see things getting worse and worse if nothing is done about it.

sammcj · 2024-05-04T12:08:24Z

Anyway, #4120 provides the functionality for now, just waiting on someone to approve it...

sammcj · 2024-05-11T02:56:28Z

#4120 is still sitting waiting for approval to be merged, I've been trying to keep it up and fix conflicts etc as other PRs are merged in.

sammcj added the feature request New feature or request label Apr 30, 2024

sammcj linked a pull request May 3, 2024 that will close this issue

feat: add support for flash_attn #4120

Open

pdevine mentioned this issue May 18, 2024

Set attention type for Mistral 7B #2875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

sammcj commented Apr 30, 2024

DuckyBlender commented Apr 30, 2024

DuckyBlender commented Apr 30, 2024

jukofyork commented May 1, 2024

jukofyork commented May 2, 2024 •

edited

sammcj commented May 2, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024 •

edited

wanderingmeow commented May 3, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024

jukofyork commented May 3, 2024

jukofyork commented May 3, 2024 •

edited

jukofyork commented May 3, 2024 •

edited

wanderingmeow commented May 3, 2024

sammcj commented May 3, 2024 •

edited

jukofyork commented May 4, 2024

jukofyork commented May 4, 2024

sammcj commented May 4, 2024

sammcj commented May 11, 2024

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Comments

sammcj commented Apr 30, 2024

DuckyBlender commented Apr 30, 2024

DuckyBlender commented Apr 30, 2024

jukofyork commented May 1, 2024

jukofyork commented May 2, 2024 • edited

sammcj commented May 2, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024 • edited

Hardware

Model

Request

LM Studio

Ollama

wanderingmeow commented May 3, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024

sammcj commented May 3, 2024

jukofyork commented May 3, 2024

jukofyork commented May 3, 2024 • edited

jukofyork commented May 3, 2024 • edited

wanderingmeow commented May 3, 2024

sammcj commented May 3, 2024 • edited

jukofyork commented May 4, 2024

jukofyork commented May 4, 2024

sammcj commented May 4, 2024

sammcj commented May 11, 2024

jukofyork commented May 2, 2024 •

edited

sammcj commented May 3, 2024 •

edited

jukofyork commented May 3, 2024 •

edited

jukofyork commented May 3, 2024 •

edited

sammcj commented May 3, 2024 •

edited