GPU offloading with little CPU RAM #3940

dcfidalgo · 2024-04-26T11:13:04Z

What is the issue?

Thanks for this amazing project, I really enjoy the simple, concise and easy-to-start interface! Keep up the fantastic work!

I have the following issue: I have a compute instance in the cloud with one NVIDIA A100 80GB and 16GB of CPU memory running Ubuntu.

When I try to run the llama3:70b model, it takes the ollama server a long time to load the model to the GPU, and as a result, I get an "Error: timed out waiting for llama runner to start" on the ollama run llama3:70b command after 10min (i could not figure out how to increase this timeout).

I noticed that ollama first tries to load the whole model into the page cache, however, in my case, it does not fit entirely. Only after the entire model is read once, offloading to the GPU will occur. My guess is that, since the initial pages got overwritten, it has to read the entire model again from the disk.

I was wondering if there is a way to start the offloading right from the beginning. Not sure if this is even possible, but I think in my case it would help.

This is the log of the server:

...
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_d_state      = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model type       = 70B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model ftype      = Q4_0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model params     = 70.55 B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model size       = 37.22 GiB (4.53 BPW)
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: general.name     = Meta-Llama-3-70B-Instruct
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: found 1 CUDA devices:
Apr 26 10:29:40 qa-mpcdf ollama[7668]:   Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_tensors: ggml ctx size =    0.55 MiB
Apr 26 10:32:26 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:26.839Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:32:27 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:27.049Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:35:11 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:11.913Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:35:12 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:12.122Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.419Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.620Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading 80 repeating layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading non-repeating layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloaded 81/81 layers to GPU
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors:        CPU buffer size =   563.62 MiB
Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors:      CUDA0 buffer size = 37546.98 MiB
Apr 26 10:36:18 qa-mpcdf ollama[7668]: .....time=2024-04-26T10:36:18.482Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/he>
Apr 26 10:36:18 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:18.683Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:36:51 qa-mpcdf ollama[7668]: .........time=2024-04-26T10:36:51.360Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:3565>
Apr 26 10:36:51 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:51.561Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:38:43 qa-mpcdf ollama[7668]: ............................time=2024-04-26T10:38:43.051Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"ht>
Apr 26 10:38:43 qa-mpcdf ollama[7668]: time=2024-04-26T10:38:43.251Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:07 qa-mpcdf ollama[7668]: .......time=2024-04-26T10:39:07.311Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/>
Apr 26 10:39:07 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:07.513Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:24 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:24.763Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/hea>
Apr 26 10:39:24 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:24.964Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:39 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:39.396Z level=ERROR source=routes.go:120 msg="error loading llama server" error="timed out waiting for llama runner to start>
Apr 26 10:39:39 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:39.396Z level=DEBUG source=server.go:832 msg="stopping llama server"
Apr 26 10:39:39 qa-mpcdf ollama[7668]: [GIN] 2024/04/26 - 10:39:39 | 500 |         10m1s |       127.0.0.1 | POST     "/api/chat"

Thanks again and have a great day!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

The text was updated successfully, but these errors were encountered:

dcfidalgo · 2024-04-26T19:39:38Z

On Discord lewismac pointed out that one could modify the expiresAt variable.

Look like expiresAt may be too short in your case, in ther server.go function for WaitUntilRunning

func (s *LlamaServer) WaitUntilRunning() error {
start := time.Now()
// TODO we need to wire up a better way to detect hangs during model load and startup of the server
expiresAt := time.Now().Add(10 * time.Minute) // be generous with timeout, large models can take a while to load
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()

I guess you could try build your own with a larger time out?

Would you be interested in a PR request making the expiresAt variable configurable via an environment variable, like debug for example? I would be more than happy to provide one.

dhiltgen · 2024-05-01T21:13:42Z

I think we'd be open to allowing users to override this via env var. Something like OLLAMA_LOAD_TIMEOUT perhaps? Go for it!

dcfidalgo added the bug Something isn't working label Apr 26, 2024

dhiltgen self-assigned this May 1, 2024

dhiltgen added feature request New feature or request and removed bug Something isn't working labels May 1, 2024

dcfidalgo linked a pull request May 3, 2024 that will close this issue

Feat: Add OLLAMA_LOAD_TIMEOUT env variable #4123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU offloading with little CPU RAM #3940

GPU offloading with little CPU RAM #3940

dcfidalgo commented Apr 26, 2024

dcfidalgo commented Apr 26, 2024

dhiltgen commented May 1, 2024

GPU offloading with little CPU RAM #3940

GPU offloading with little CPU RAM #3940

Comments

dcfidalgo commented Apr 26, 2024

What is the issue?

OS

GPU

CPU

Ollama version

dcfidalgo commented Apr 26, 2024

dhiltgen commented May 1, 2024