Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU offloading with little CPU RAM #3940

Open
dcfidalgo opened this issue Apr 26, 2024 · 2 comments · May be fixed by #4123
Open

GPU offloading with little CPU RAM #3940

dcfidalgo opened this issue Apr 26, 2024 · 2 comments · May be fixed by #4123
Assignees
Labels
feature request New feature or request

Comments

@dcfidalgo
Copy link

What is the issue?

Thanks for this amazing project, I really enjoy the simple, concise and easy-to-start interface! Keep up the fantastic work!

I have the following issue: I have a compute instance in the cloud with one NVIDIA A100 80GB and 16GB of CPU memory running Ubuntu.

When I try to run the llama3:70b model, it takes the ollama server a long time to load the model to the GPU, and as a result, I get an "Error: timed out waiting for llama runner to start" on the ollama run llama3:70b command after 10min (i could not figure out how to increase this timeout).

I noticed that ollama first tries to load the whole model into the page cache, however, in my case, it does not fit entirely. Only after the entire model is read once, offloading to the GPU will occur. My guess is that, since the initial pages got overwritten, it has to read the entire model again from the disk.

I was wondering if there is a way to start the offloading right from the beginning. Not sure if this is even possible, but I think in my case it would help.

This is the log of the server:

...
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_d_state      = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model type       = 70B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model ftype      = Q4_0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model params     = 70.55 B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: general.name     = Meta-Llama-3-70B-Instruct
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: found 1 CUDA devices:
Apr 26 10:29:40 qa-mpcdf ollama[7668]:   Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_tensors: ggml ctx size =    0.55 MiB
Apr 26 10:32:26 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:26.839Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:32:27 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:27.049Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:35:11 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:11.913Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:35:12 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:12.122Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.419Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.620Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading 80 repeating layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading non-repeating layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloaded 81/81 layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors:        CPU buffer size =   563.62 MiB
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors:      CUDA0 buffer size = 37546.98 MiB
Apr 26 10:36:18 qa-mpcdf ollama[7668]: .....time=2024-04-26T10:36:18.482Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/he>
Apr 26 10:36:18 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:18.683Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:36:51 qa-mpcdf ollama[7668]: .........time=2024-04-26T10:36:51.360Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:3565>
Apr 26 10:36:51 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:51.561Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:38:43 qa-mpcdf ollama[7668]: ............................time=2024-04-26T10:38:43.051Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"ht>
Apr 26 10:38:43 qa-mpcdf ollama[7668]: time=2024-04-26T10:38:43.251Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:07 qa-mpcdf ollama[7668]: .......time=2024-04-26T10:39:07.311Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/>
Apr 26 10:39:07 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:07.513Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:24 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:24.763Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/hea>
Apr 26 10:39:24 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:24.964Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:39 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:39.396Z level=ERROR source=routes.go:120 msg="error loading llama server" error="timed out waiting for llama runner to start>
Apr 26 10:39:39 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:39.396Z level=DEBUG source=server.go:832 msg="stopping llama server"
Apr 26 10:39:39 qa-mpcdf ollama[7668]: [GIN] 2024/04/26 - 10:39:39 | 500 |         10m1s |       127.0.0.1 | POST     "/api/chat"

Thanks again and have a great day!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

@dcfidalgo dcfidalgo added the bug Something isn't working label Apr 26, 2024
@dcfidalgo
Copy link
Author

On Discord lewismac pointed out that one could modify the expiresAt variable.

Look like expiresAt may be too short in your case, in ther server.go function for WaitUntilRunning

func (s *LlamaServer) WaitUntilRunning() error {
start := time.Now()
// TODO we need to wire up a better way to detect hangs during model load and startup of the server
expiresAt := time.Now().Add(10 * time.Minute) // be generous with timeout, large models can take a while to load
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()

I guess you could try build your own with a larger time out?

Would you be interested in a PR request making the expiresAt variable configurable via an environment variable, like debug for example? I would be more than happy to provide one.

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 1, 2024

I think we'd be open to allowing users to override this via env var. Something like OLLAMA_LOAD_TIMEOUT perhaps? Go for it!

@dhiltgen dhiltgen self-assigned this May 1, 2024
@dhiltgen dhiltgen added feature request New feature or request and removed bug Something isn't working labels May 1, 2024
@dcfidalgo dcfidalgo linked a pull request May 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants