You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this amazing project, I really enjoy the simple, concise and easy-to-start interface! Keep up the fantastic work!
I have the following issue: I have a compute instance in the cloud with one NVIDIA A100 80GB and 16GB of CPU memory running Ubuntu.
When I try to run the llama3:70b model, it takes the ollama server a long time to load the model to the GPU, and as a result, I get an "Error: timed out waiting for llama runner to start" on the ollama run llama3:70b command after 10min (i could not figure out how to increase this timeout).
I noticed that ollama first tries to load the whole model into the page cache, however, in my case, it does not fit entirely. Only after the entire model is read once, offloading to the GPU will occur. My guess is that, since the initial pages got overwritten, it has to read the entire model again from the disk.
I was wondering if there is a way to start the offloading right from the beginning. Not sure if this is even possible, but I think in my case it would help.
This is the log of the server:
...
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_d_state = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: ssm_dt_rank = 0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model type = 70B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model ftype = Q4_0
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model params = 70.55 B
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: model size = 37.22 GiB (4.53 BPW)
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: general.name = Meta-Llama-3-70B-Instruct
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_print_meta: LF token = 128 'Ä'
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 26 10:29:40 qa-mpcdf ollama[7668]: ggml_cuda_init: found 1 CUDA devices:
Apr 26 10:29:40 qa-mpcdf ollama[7668]: Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Apr 26 10:29:40 qa-mpcdf ollama[7668]: llm_load_tensors: ggml ctx size = 0.55 MiB
Apr 26 10:32:26 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:26.839Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>Apr 26 10:32:27 qa-mpcdf ollama[7668]: time=2024-04-26T10:32:27.049Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"Apr 26 10:35:11 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:11.913Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>
Apr 26 10:35:12 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:12.122Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.419Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/health\>Apr 26 10:35:52 qa-mpcdf ollama[7668]: time=2024-04-26T10:35:52.620Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"Apr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading 80 repeating layers to GPUApr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloading non-repeating layers to GPUApr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: offloaded 81/81 layers to GPUApr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: CPU buffer size = 563.62 MiBApr 26 10:36:07 qa-mpcdf ollama[7668]: llm_load_tensors: CUDA0 buffer size = 37546.98 MiBApr 26 10:36:18 qa-mpcdf ollama[7668]: .....time=2024-04-26T10:36:18.482Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/he>
Apr 26 10:36:18 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:18.683Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:36:51 qa-mpcdf ollama[7668]: .........time=2024-04-26T10:36:51.360Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:3565>Apr 26 10:36:51 qa-mpcdf ollama[7668]: time=2024-04-26T10:36:51.561Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"Apr 26 10:38:43 qa-mpcdf ollama[7668]: ............................time=2024-04-26T10:38:43.051Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"ht>
Apr 26 10:38:43 qa-mpcdf ollama[7668]: time=2024-04-26T10:38:43.251Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:07 qa-mpcdf ollama[7668]: .......time=2024-04-26T10:39:07.311Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/>Apr 26 10:39:07 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:07.513Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"Apr 26 10:39:24 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:24.763Z level=DEBUG source=server.go:420 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:35651/hea>
Apr 26 10:39:24 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:24.964Z level=DEBUG source=server.go:420 msg="server not yet available" error="server not responding"
Apr 26 10:39:39 qa-mpcdf ollama[7668]: ....time=2024-04-26T10:39:39.396Z level=ERROR source=routes.go:120 msg="error loading llama server" error="timed out waiting for llama runner to start>Apr 26 10:39:39 qa-mpcdf ollama[7668]: time=2024-04-26T10:39:39.396Z level=DEBUG source=server.go:832 msg="stopping llama server"Apr 26 10:39:39 qa-mpcdf ollama[7668]: [GIN] 2024/04/26 - 10:39:39 | 500 | 10m1s | 127.0.0.1 | POST "/api/chat"
Thanks again and have a great day!
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.1.32
The text was updated successfully, but these errors were encountered:
On Discord lewismac pointed out that one could modify the expiresAt variable.
Look like expiresAt may be too short in your case, in ther server.go function for WaitUntilRunning
func (s *LlamaServer) WaitUntilRunning() error {
start := time.Now()
// TODO we need to wire up a better way to detect hangs during model load and startup of the server
expiresAt := time.Now().Add(10 * time.Minute) // be generous with timeout, large models can take a while to load
ticker := time.NewTicker(50 * time.Millisecond)
defer ticker.Stop()
I guess you could try build your own with a larger time out?
Would you be interested in a PR request making the expiresAt variable configurable via an environment variable, like debug for example? I would be more than happy to provide one.
What is the issue?
Thanks for this amazing project, I really enjoy the simple, concise and easy-to-start interface! Keep up the fantastic work!
I have the following issue: I have a compute instance in the cloud with one NVIDIA A100 80GB and 16GB of CPU memory running Ubuntu.
When I try to run the llama3:70b model, it takes the ollama server a long time to load the model to the GPU, and as a result, I get an "Error: timed out waiting for llama runner to start" on the
ollama run llama3:70b
command after 10min (i could not figure out how to increase this timeout).I noticed that ollama first tries to load the whole model into the page cache, however, in my case, it does not fit entirely. Only after the entire model is read once, offloading to the GPU will occur. My guess is that, since the initial pages got overwritten, it has to read the entire model again from the disk.
I was wondering if there is a way to start the offloading right from the beginning. Not sure if this is even possible, but I think in my case it would help.
This is the log of the server:
Thanks again and have a great day!
OS
Linux
GPU
Nvidia
CPU
Intel
Ollama version
0.1.32
The text was updated successfully, but these errors were encountered: