Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

Open
sammcj opened this issue Apr 30, 2024 · 20 comments · May be fixed by #4120
Open

Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #4051

sammcj opened this issue Apr 30, 2024 · 20 comments · May be fixed by #4120
Labels
feature request New feature or request

Comments

@sammcj
Copy link

sammcj commented Apr 30, 2024

Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.

  • Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

@sammcj sammcj added the feature request New feature or request label Apr 30, 2024
@DuckyBlender
Copy link

Are there any cons for using Flash Attention? If not it probably should be the default.

@DuckyBlender
Copy link

Also if there is more memory available we should really set the default quants to Q4_K_M and fallback to Q4 when Q4_K_M is not available

@jukofyork
Copy link

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Yeah, I think this would help a lot.

@jukofyork
Copy link

jukofyork commented May 2, 2024

Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.

* Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Just been looking at the source to see if this can be done:

https://github.com/ollama/ollama/blob/main/llm/ext_server/server.cpp

and even though it's importing the latest gpt_params structure (from the lamma.cpp submodule dated 2 days ago), all the parsing is using some server_params_parse function stripped from a outdated version of the llama.cpp server.

Some of the default values (eg: --repeat-penalty) and even meanings (eg: --batch) are also behind because of this.

I guess we could try to make a diff to see if this can be updated, but since PRs never seem to get accepted and random stuff constantly seems to change regarding the passing of command line options, I'm reluctant to even try...

@sammcj
Copy link
Author

sammcj commented May 2, 2024

I had a hot crack at a PR tonight but as you said @jukofyork it seems server.cpp has diverged from llama.cpp a lot and I couldn't get my head around it.

Might have to wait for someone more familiar with Ollama's version of it / more brains than me.

@sammcj
Copy link
Author

sammcj commented May 3, 2024

FYI: LM Studio added Flash Attention earlier today: https://www.reddit.com/r/LocalLLaMA/comments/1cir98j/lm_studio_released_new_version_with_flash/

@sammcj
Copy link
Author

sammcj commented May 3, 2024

Comparing GGUF performance with/without Flash Attention

Hardware

  • Apple Macbook Pro M2 Max (96GB)

Model

  • llama3-bartowski-8b-instruct-q8-0.gguf

Request

curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0",
        "messages": [
            {
                "role": "user",
                "content": "Tell me two short jokes."
            }
        ]
    }'

LM Studio

With Flash Attention

Start time: 14:10:20.401
End time: 14:10:21.636
"prompt_tokens": 38,
"completion_tokens": 37,
"total_tokens": 75
  • 1.235 seconds~
  • 60.65 tokens/s~

Ollama

Without Flash Attention

total duration:       1.866343375s
load duration:        1.299853709s
prompt eval count:    15 token(s)
prompt eval duration: 84.22ms
prompt eval rate:     178.10 tokens/s
eval count:           18 token(s)
eval duration:        481.335ms
eval rate:            37.40 tokens/s

Start TIMESTAMP: 1714709035
End TIMESTAMP: 1714709036
  • 1.866 seconds
  • 40.38 tokens/s

@wanderingmeow
Copy link

Got it working by adding the OLLAMA_LLAMA_EXTRA_ARGS environment variable in llm/server.go as @sammcj suggested. This allows the -fa flag be passed into ext_server/server.cpp. If anyone is interested, my hack can be found here. Hopefully this can help others get Flash Attention working too.

@sammcj
Copy link
Author

sammcj commented May 3, 2024

That looks VERY similar to what I tried @wanderingmeow, perhaps I just wasn't parsing the parameters correctly 🤔 , that's awesome!

@sammcj
Copy link
Author

sammcj commented May 3, 2024

That worked instantly, still no where near as fast as LM Studio but it's a start.

export OLLAMA_LLAMA_EXTRA_ARGS="-fa"

ollama serve

...
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
...
ollama run registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0 'tell me two short jokes' --verbose
Here are two short jokes:

1. Why don't scientists trust atoms? Because they make up everything!
2. Why don't eggs tell jokes? They'd crack each other up!

Hope you find them amusing!

total duration:       2.539137625s
load duration:        1.286854542s
prompt eval count:    15 token(s)
prompt eval duration: 83.363ms
prompt eval rate:     179.94 tokens/s
eval count:           44 token(s)
eval duration:        1.168093s
eval rate:            37.67 tokens/s

@sammcj sammcj linked a pull request May 3, 2024 that will close this issue
@sammcj
Copy link
Author

sammcj commented May 3, 2024

@wanderingmeow I've created a PR with the changes - #4120

@jukofyork
Copy link

Awesome! 👍

@jukofyork
Copy link

jukofyork commented May 3, 2024

I wonder if we can extend this:

	if other_args := os.Getenv("OLLAMA_LLAMA_EXTRA_ARGS"); other_args != "" {
		params = append(params, strings.Split(other_args, ",")...)
	}

To use a reg-ex to match against the model name passed in the model string?

Then we can just completely bypass the modelfile code and pass model-specific parameters as well? This would save all the hassle of PRs not getting accepted and then breaking because the code all got moved around!

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

It probably doesn't want to be split using "," though as that is used by llama.cpp for some options like tensor-split, etc.

@jukofyork
Copy link

jukofyork commented May 3, 2024

Something like:

export OLLAMA_LLAMA_EXTRA_ARGS=".* ; -fa | qwen.* ; -sm row ; -ts 3,2 | llama2.* ; --rope-freq-base 8192"

Although I'm not sure if the model string will actually be the name or the weird sha hash name that points to the GGUF file?

@wanderingmeow
Copy link

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

@sammcj
Copy link
Author

sammcj commented May 3, 2024

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:

  1. Split out Ollama's custom server configuration from the model server parameters.
  2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).
  3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.

That's way out of my ball park though 😅

@jukofyork
Copy link

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

It really helps reduce the VRAM use of long-context models: some I could only run at 16k or 32k are now running with 32k or 64k for the same quant!

@jukofyork
Copy link

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:

1. Split out Ollama's custom server configuration from the model server parameters.

2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).

3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.

That's way out of my ball park though 😅

Yeah, but it would be good for the long-run: some of the parameters and their default settings that Ollama is using are very out of date compared to llama.cpp's server now and I can only see things getting worse and worse if nothing is done about it.

@sammcj
Copy link
Author

sammcj commented May 4, 2024

Anyway, #4120 provides the functionality for now, just waiting on someone to approve it...

@sammcj
Copy link
Author

sammcj commented May 11, 2024

#4120 is still sitting waiting for approval to be merged, I've been trying to keep it up and fix conflicts etc as other PRs are merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants