Improving the efficiency of using multiple GPU cards. #4198

zhqfdn · 2024-05-06T10:49:23Z

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

When used by multiple users simultaneously, it is slower. If evenly distributed across multiple GPU cards, it can improve the utilization rate of GPU cards and improve efficiency.

Tesla T4 GPU list

localhost.localdomain Mon May 6 18:41:30 2024 550.54.15
[0] Tesla T4 | 54°C, 93 % | 12238 / 15360 MB | ollama(12236M)
[1] Tesla T4 | 36°C, 0 % | 2 / 15360 MB |
[2] Tesla T4 | 30°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB |

ollama.service

`[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS='*'"
Environment="OLLAMA_MODELS=/ollama/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target`

zhqfdn · 2024-05-06T11:10:56Z

Environment="OLLAMA_NUM_PARALLEL=5"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Three people simultaneously used and loaded llama 3, which was very fast.
The fourth person used to load codememma, which was very slow.

localhost.localdomain Mon May 6 19:12:05 2024 550.54.15
[0] Tesla T4 | 57°C, 94 % | 13790 / 15360 MB | ollama(13788M) ---> llama 3
[1] Tesla T4 | 52°C, 9 % | 13828 / 15360 MB | ollama(13806M) ---> codememma
[2] Tesla T4 | 31°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 34°C, 0 % | 2 / 15360 MB |

kungfu-eric · 2024-05-06T22:49:46Z

Observing this too #4212. Horrible perf regression

gaborkukucska · 2024-05-07T07:20:26Z

Additionally, it would be awesome if it could also load across networked GPUs like Petals do.

This would allow communities with older GPUs to combine their vram... especially locally where multiple computers are on a local network eg schools.

kungfu-eric · 2024-05-07T13:58:46Z

Confirmed that downgrading to 0.1.31 resolves this issue for me as per @zhqfdn suggestion

It must be the new GPU detection added post 0.1.31. Should it revert to the cudart GPU detection? Here's the comparison in the logs:

0.1.31

time=2024-05-07T06:54:10.510-07:00 level=INFO source=images.go:804 msg="total blobs: 62"
time=2024-05-07T06:54:10.511-07:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0"
time=2024-05-07T06:54:10.512-07:00 level=INFO source=routes.go:1118 msg="Listening on [::]:7200 (version 0.1.31)"
time=2024-05-07T06:54:10.521-07:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2237025292/runners ..."
time=2024-05-07T06:54:13.192-07:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cuda_v11 rocm_v60000 cpu_avx cpu_avx2 cpu]"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*"
time=2024-05-07T06:54:13.193-07:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama2237025292/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.11.7.60]"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-07T06:54:14.969-07:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"

0.1.34rc1

2024/05/07 06:38:53 routes.go:989: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:897 msg="total blobs: 62"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:904 msg="total unused blobs removed: 0"
time=2024-05-07T06:38:53.155-07:00 level=INFO source=routes.go:1034 msg="Listening on 127.0.0.1:11434 (version 0.1.34-rc1)"
time=2024-05-07T06:38:53.163-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1226851952/runners
time=2024-05-07T06:38:55.816-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-05-07T06:38:55.816-07:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-07T06:38:56.036-07:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=3 library=/usr/lib/x86_64-linux-gnu/libcuda.so.515.43.04
time=2024-05-07T06:38:56.036-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

I tested 0.1.33 ga and it had the same issue as per #4212

dhiltgen · 2024-05-08T21:34:46Z

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

@zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples.

I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this.

zhqfdn added the feature request New feature or request label May 6, 2024

dhiltgen self-assigned this May 8, 2024

dhiltgen linked a pull request May 8, 2024 that will close this issue

Support forced spreading for multi GPU #4266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the efficiency of using multiple GPU cards. #4198

Improving the efficiency of using multiple GPU cards. #4198

zhqfdn commented May 6, 2024 •

edited

zhqfdn commented May 6, 2024 •

edited

kungfu-eric commented May 6, 2024

gaborkukucska commented May 7, 2024

kungfu-eric commented May 7, 2024 •

edited

dhiltgen commented May 8, 2024

Improving the efficiency of using multiple GPU cards. #4198

Improving the efficiency of using multiple GPU cards. #4198

Comments

zhqfdn commented May 6, 2024 • edited

Tesla T4 GPU list

ollama.service

zhqfdn commented May 6, 2024 • edited

kungfu-eric commented May 6, 2024

gaborkukucska commented May 7, 2024

kungfu-eric commented May 7, 2024 • edited

dhiltgen commented May 8, 2024

zhqfdn commented May 6, 2024 •

edited

zhqfdn commented May 6, 2024 •

edited

kungfu-eric commented May 7, 2024 •

edited