Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only 1 GPU found -- regression 1.32 -> 1.33 #4139

Closed
AlexLJordan opened this issue May 3, 2024 · 19 comments · May be fixed by #4264
Closed

only 1 GPU found -- regression 1.32 -> 1.33 #4139

AlexLJordan opened this issue May 3, 2024 · 19 comments · May be fixed by #4264
Assignees
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA

Comments

@AlexLJordan
Copy link

What is the issue?

Hi everyone,

Sorry I don't have much time to write much; but going from 1.32 to 1.33, this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
  Device 2: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.45 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:      CUDA0 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA1 buffer size =  1194.53 MiB
llm_load_tensors:      CUDA2 buffer size =  1188.49 MiB

changed into this:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 3 repeating layers to GPU
llm_load_tensors: offloaded 3/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3647.87 MiB
llm_load_tensors:      CUDA0 buffer size =   325.78 MiB

1.33 hammers my CPU cores, is generally slower and doesn't even utilize the one GPU it does find properly.

I need the new concurrency features, so I'd really appreciate it if 1.33 worked on my machine.

Please help.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

1.33

@AlexLJordan AlexLJordan added the bug Something isn't working label May 3, 2024
@jmorganca jmorganca added the nvidia Issues relating to Nvidia GPUs and CUDA label May 3, 2024
@dhiltgen dhiltgen self-assigned this May 3, 2024
@dhiltgen
Copy link
Collaborator

dhiltgen commented May 3, 2024

Can you share more of the server log, ideally with OLLAMA_DEBUG=1 set so we can see the early bootstrapping GPU discovery logic.

@AlexLJordan
Copy link
Author

AlexLJordan commented May 3, 2024

These are logs that I store automatically; so they don't have OLLAMA_DEBUG set. It's late here, so if these logs aren't helpful, I'll need to rerun it with DEBUG tomorrow.

Ollama 1.32
ollama-1.32.log

Ollama 1.33
ollama-1.33.log

Thanks for your help!

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 3, 2024

From the logs I can see that we did discover all 3 GPUs

time=2024-05-03T22:22:34.769+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3

Unfortunately without the debug set, I can't see why the scheduler decided to run on only a single GPU with only 3 layers. If you can re-run just the 0.1.33 with OLLAMA_DEBUG=1 and share the log that will help root cause the defect.

@bsdnet
Copy link
Contributor

bsdnet commented May 4, 2024

@dhiltgen seems the log you referred is from ollama-1.32.log

@AlexLJordan
Copy link
Author

AlexLJordan commented May 4, 2024

Hi again!

I was able to rerun the workload with DEBUG enabled on both versions [see below].

Weirdly enough Ollama 1.33 uses a full GPU this time:
Selection_2024-05-04--001

It's still much slower than 1.32, where one set of jobs completes in half an hour; and 1.33 shows somewhere between 2 and 3.5h projected completion time.

<EDIT>
Additional weirdness:
Yesterdays run of 1.33 didn't have that msg="detected GPUs" library=/tmp/ollama2048242415/runners/cuda_v11/libcudart.so.11.0 count=3 line as bsdnet already pointed out.
But in the attached log the following line showed up:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3

</EDIT>

I'm relatively sure that the only changes from yesterday to today are adding this to the environment (.env) file where Ollama runs. (Turns out I had the Ollama EnvVars in the wrong file and NUM_PARALLEL as well as MAX_LOADED_MODELS weren't included in the environment yesterday.)

export OLLAMA_NUM_PARALLEL=16
export OLLAMA_MAX_LOADED_MODELS=3

export OLLAMA_DEBUG=1

ollama-1.32-DEBUG.log

ollama-1.33-DEBUG.log

@bsdnet
Copy link
Contributor

bsdnet commented May 4, 2024

Not sure whether the issue comes from timing :)

Enabling debug usually means more logging; More logging usually means timing changed.

One way to confirm this is to run 1.33 without DEBUG enabled.

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 4, 2024

Based on your 0.1.33 log with debug enabled..

It sees all 3 GPUs:

time=2024-05-04T14:05:09.283+02:00 level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama2604474549/runners/cuda_v11/libcudart.so.11.0 count=3
time=2024-05-04T14:05:09.283+02:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA totalMem 34089730048
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] CUDA freeMem 33765720064
[GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7] Compute Capability 7.0
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA totalMem 34089730048
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] CUDA freeMem 33765720064
[GPU-5105e575-3fba-efc4-c055-9b7051c99884] Compute Capability 7.0
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA totalMem 34089730048
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] CUDA freeMem 33765720064
[GPU-d6e916e0-25b7-1a3d-845b-3c4531c447e4] Compute Capability 7.0

The scheduler determined the requested model could fit in a single GPU for best performance

time=2024-05-04T14:05:10.668+02:00 level=DEBUG source=sched.go:508 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aljordan/.ollama/models/blobs/sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 gpu=GPU-a67a747a-f135-4d23-c0c3-5b2dd68c33d7 available=33765720064 required="5222.6 MiB"

and we can see the backend loaded all the layers

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
time=2024-05-04T14:05:10.922+02:00 level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding"
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU

It is possible we have a scheduling race we haven't found/fixed yet since the scheduler code is brand new. If you manage to repro the failure mode of hitting a single GPU with partial offload, share the logs so we can see what the scheduler was doing.

@thevisad
Copy link

thevisad commented May 4, 2024

I had the same issue today and rolled back to 1.31 and this resolved the issue. I spent the day in the discord chatting with the users, trying various things without resolution. I was able to up num_gpu to the amount required and it will then find and utilize both GPUs.


May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:828 msg="total blobs: 30"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.005Z level=INFO source=images.go:835 msg="total unused blobs removed: 0"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=routes.go:1071 msg="Listening on [::]:11434 (version 0.1.33)"
May  4 15:51:13 prettygirl ollama[31666]: time=2024-05-04T15:51:13.006Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1884500785/runners
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.131Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.132Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:51:16 prettygirl ollama[31666]: time=2024-05-04T15:51:16.278Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.533Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama1884500785/runners/cuda_v11/libcudart.so.11.0 count=2
May  4 15:52:02 prettygirl ollama[31666]: time=2024-05-04T15:52:02.535Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="48402.9 MiB" memory.required.full="5222.6 MiB" memory.required.partial="5222.                                                               6 MiB" memory.required.kv="1024.0 MiB" memory.weights.total="3577.6 MiB" memory.weights.repeating="3475.0 MiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="193.0 MiB"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.076Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama1884500785/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-3a43                                                               f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 1 --port 43549"
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=sched.go:340 msg="loaded runners" count=1
May  4 15:52:03 prettygirl ollama[31666]: time=2024-05-04T15:52:03.077Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
May  4 15:52:03 prettygirl ollama[31772]: {"function":"server_params_parse","level":"INFO","line":2606,"msg":"logging to file is disabled.","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140550867894272","timestamp":1714837923}
May  4 15:52:03 prettygirl ollama[31772]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":6,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N                                                               EON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140550867894272","timestamp":1714837923,"total_threads":12}
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   1:                               general.name str              = codellama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type  f32:   65 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q4_0:  225 tensors
May  4 15:52:03 prettygirl ollama[31666]: llama_model_loader: - type q6_K:    1 tensors
May  4 15:52:03 prettygirl ollama[31666]: llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: format           = GGUF V2
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: arch             = llama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: vocab type       = SPM
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_vocab          = 32016
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_merges         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ctx_train      = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd           = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head           = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_head_kv        = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_layer          = 32
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_rot            = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_k    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_head_v    = 128
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_gqa            = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_k_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_embd_v_gqa     = 4096
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_ff             = 11008
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert         = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_expert_used    = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: causal attn      = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: pooling type     = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope type        = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope scaling     = linear
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_base_train  = 1000000.0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: freq_scale_train = 1
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: rope_finetuned   = unknown
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_conv       = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_inner      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_d_state      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: ssm_dt_rank      = 0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model type       = 7B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model ftype      = Q4_0
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model params     = 6.74 B
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: general.name     = codellama
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: BOS token        = 1 '<s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOS token        = 2 '</s>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: UNK token        = 0 '<unk>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: LF token         = 13 '<0x0A>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: MID token        = 32009 '▁<MID>'
May  4 15:52:03 prettygirl ollama[31666]: llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
May  4 15:52:03 prettygirl ollama[31666]: ggml_cuda_init: found 1 CUDA devices:
May  4 15:52:03 prettygirl ollama[31666]:   Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: ggml ctx size =    0.30 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading 32 repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloading non-repeating layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors: offloaded 33/33 layers to GPU
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:        CPU buffer size =    70.35 MiB
May  4 15:52:03 prettygirl ollama[31666]: llm_load_tensors:      CUDA0 buffer size =  3577.61 MiB
May  4 15:52:04 prettygirl ollama[31666]: ..................................................................................................
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ctx      = 2048
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_batch    = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: n_ubatch   = 512
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_base  = 1000000.0
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: freq_scale = 1
May  4 15:52:04 prettygirl ollama[31666]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph nodes  = 1030
May  4 15:52:04 prettygirl ollama[31666]: llama_new_context_with_model: graph splits = 2

@JieChenSimon
Copy link

same issue occurred to me when upgrade
image

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 5, 2024

@thevisad and @JieChenSimon from what I can tell, the system is behaving as expected in your examples. We try NOT to spread a single model over multiple GPUs now as that actually makes things run slower, not faster if the model could fit within one GPU. We now only spread a model to multiple GPUs if it wont fit in a single GPU. If that's not the behavior you're seeing, can you clarify?

@wlsoft2006
Copy link

image

@wlsoft2006
Copy link

wlsoft2006 commented May 6, 2024

only one gpu in use after update to 1.33
Linux ai-centos7 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.9.2009 (Core)

@wlsoft2006
Copy link

wlsoft2006 commented May 6, 2024

image
When I load both models at the same time it works!
That's all I need, No problem!

@nyoma-diamond
Copy link

nyoma-diamond commented May 8, 2024

EDIT: Turned out to be user error. My system's administrator for some reason decided to set the CUDA_VISIBLE_DEVICES environment variable for each user so they could only access one specific GPU (I happened to be specifically set to GPU 1). I thought I had CUDA_VISIBLE_DEVICES unset but when I checked again on a fresh bash session it was set to the device ID for GPU 1. Unsetting the variable or adding the IDs of other GPUs resolved this.

I'm also running into this problem. The system I am using has 4x Nvidia P100s but Ollama only sees one at any given moment (from what I can tell, always GPU 1, not 0, 2, or 3). However, I'm observing this behavior on both v0.1.32 and v0.1.34

time=2024-05-08T14:44:14.522+01:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-08T14:44:14.575+01:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15

Output of nvidia-smi (abbreviated):

Wed May  8 14:49:06 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:25:00.0 Off |                    0 |
| N/A   32C    P0             27W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off |   00000000:5B:00.0 Off |                    0 |
| N/A   50C    P0             42W /  250W |    5254MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P100-PCIE-16GB           Off |   00000000:9B:00.0 Off |                    0 |
| N/A   33C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla P100-PCIE-16GB           Off |   00000000:C8:00.0 Off |                    0 |
| N/A   33C    P0             33W /  250W |     288MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

As a result, large models get partially loaded onto one GPU and any excess is offloaded to CPU instead of using the remaining three GPUs. In logs Ollama says it only detects the one GPU. Occurs both on v0.1.32 and v0.1.34 with or without OLLAMA_DEBUG enabled.

It may be worth noting that the GPU that Ollama detects is always GPU 1 (as listed in nvidia-smi). Since this system is shared across multiple users, this also causes problems when someone is already using the selected GPU, causing Ollama to offload the entire model to the CPU, rather than using any of the other completely free GPUs.

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 8, 2024

I'm working on a change that will expose this setting in the logs during startup so it's easier to spot misconfigurations.

What I also noticed is we have a regression in 0.1.34 where CUDA_VISIBLE_DEVICES is no longer filtering out GPUs since we switched from the cuda runtime library to the nvidia driver library in the latest release. I'll look at adding a fix for that in the PR as well.

Update: my test was incorrect, CUDA_VISIBLE_DEVICES is still working properly.

@ToRvaLDz
Copy link

ToRvaLDz commented May 20, 2024

I have the same problem in docker, I have 13 gpus but it only find 1:

ggml_cuda_init: found 1 CUDA devices:  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
NVIDIA_VISIBLE_DEVICES=all
HOSTNAME=502558fb132a
PWD=/
NVIDIA_DRIVER_CAPABILITIES=compute,utility
OLLAMA_MAX_LOADED_MODELS=3
CUDA_VISIBLE_DEVICES=12
OLLAMA_HOST=0.0.0.0
TERM=xterm
SHLVL=1
OLLAMA_NUM_PARALLEL=12
OLLAMA_DEBUG=0
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
OLLAMA_KEEP_ALIVE=24h
_=/usr/bin/env
time=2024-05-20T11:50:34.699Z level=INFO source=images.go:704 msg="total blobs: 5"
time=2024-05-20T11:50:34.700Z level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T11:50:34.701Z level=INFO source=routes.go:1054 msg="Listening on [::]:11434 (version 0.1.38)"
time=2024-05-20T11:50:34.701Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1665127007/runners
time=2024-05-20T11:50:38.352Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T11:50:44.849Z level=INFO source=types.go:71 msg="inference compute" id=GPU-4faa64e1-cd46-b533-d16d-c39809fde7ac library=cuda compute=7.5 driver=12.4 name="NVIDIA GeForce GTX 1660 SUPER" total="5.8 GiB" available="5.7 GiB"                                  ttl=64 time=0.732 ms
[GIN] 2024/05/20 - 11:51:13 | 200 |     572.484µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     346.879µs |      172.18.0.1 | GET      "/api/tags"
[GIN] 2024/05/20 - 11:51:13 | 200 |     274.741µs |      172.18.0.1 | GET      "/api/tags"                          8: INFO server config env="map[OLLAMA_DEBUG:false 
[GIN] 2024/05/20 - 11:51:13 | 200 |      29.468µs |      172.18.0.1 | GET      "/api/version"                       M_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost http
[GIN] 2024/05/20 - 11:51:20 | 200 |      90.274µs |      172.18.0.1 | GET      "/api/version"                       27.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0
time=2024-05-20T11:51:25.103Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.104Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=16 memory.available="5.7 GiB" memory.required.full="9.2 GiB" memory.required.partial="5.6 GiB" memory.required.kv="3.0 GiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.6 GiB" memory.graph.partial="1.7 GiB"
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama1665127007/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 24576 --batch-size 512 --embedding --log-disable --n-gpu-layers 16 --parallel 12 --port 45825"
time=2024-05-20T11:51:25.105Z level=INFO source=sched.go:338 msg="loaded runners" count=1                           3 memory.available="5.7 GiB" memory.required.full=
time=2024-05-20T11:51:25.105Z level=INFO source=server.go:504 msg="waiting for llama runner to start responding"    ghts.repeating="3.7 GiB" memory.weights.nonrepeati
time=2024-05-20T11:51:25.106Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="134968727871488" timestamp=1716205885                        1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134968727871488" timestamp=1716205885 total_threads=4
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="14" port="45825" tid="134968727871488" timestamp=1716205885
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.                    | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | N
llama_model_loader: - kv   0:                       general.architecture str              = llama                   INT8 = 0 | LLAMAFILE = 1 | " tid="129685691437056"
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct | hostname="127.0.0.1" n_threads_http="3" port="4
llama_model_loader: - kv   2:                          llama.block_count u32              = 32                      ata with 21 key-value pairs and 291 tensors from /
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192                    latest))
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2                        1, 1, 1, 1, ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-05-20T11:51:25.357Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llm_load_tensors:      CUDA0 buffer size =  1872.50 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 24576
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1536.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1536.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     6.06 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1705.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    56.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 180

Inside the docker container:

Mon May 20 11:54:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    Off |   00000000:01:00.0 Off |                  N/A |
| 40%   46C    P0             26W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1660 ...    Off |   00000000:04:00.0 Off |                  N/A |
| 37%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1660 ...    Off |   00000000:06:00.0 Off |                  N/A |
| 39%   43C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce GTX 1660 ...    Off |   00000000:08:00.0 Off |                  N/A |
| 42%   47C    P0             28W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce GTX 1660 ...    Off |   00000000:09:00.0 Off |                  N/A |
| 41%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0A:00.0 Off |                  N/A |
| 36%   41C    P0             33W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0B:00.0 Off |                  N/A |
| 45%   43C    P0             30W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0C:00.0 Off |                  N/A |
| 43%   45C    P0             31W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   8  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0D:00.0 Off |                  N/A |
| 43%   44C    P0             30W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   9  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0E:00.0 Off |                  N/A |
| 22%   44C    P0             28W /  125W |       1MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  10  NVIDIA GeForce GTX 1660 ...    Off |   00000000:0F:00.0 Off |                  N/A |
| 29%   44C    P0             31W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  11  NVIDIA GeForce GTX 1660 ...    Off |   00000000:10:00.0 Off |                  N/A |
| 20%   43C    P0             32W /  125W |       1MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|  12  NVIDIA GeForce GTX 1660 ...    Off |   00000000:11:00.0 Off |                  N/A |
| 45%   47C    P2             30W /  125W |    5441MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@dhiltgen
Copy link
Collaborator

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

@ToRvaLDz
Copy link

@ToRvaLDz CUDA_VISIBLE_DEVICES=12 will only expose one of your GPUs to ollama. If you remove that environment variable, then it should see all the devices. Alternatively, you could set it to CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12

I'm sorry, you a re right. Thank you.

@dhiltgen
Copy link
Collaborator

I'm going to mark this one closed now as the visible devices env var seems to be working properly. I am working on some improvements in concurrency memory predictions that help when operating at near max vram allocation, which should land in an upcoming release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working nvidia Issues relating to Nvidia GPUs and CUDA
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants