Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the efficiency of using multiple GPU cards. #4198

Open
zhqfdn opened this issue May 6, 2024 · 5 comments · May be fixed by #4266
Open

Improving the efficiency of using multiple GPU cards. #4198

zhqfdn opened this issue May 6, 2024 · 5 comments · May be fixed by #4266
Assignees
Labels
feature request New feature or request

Comments

@zhqfdn
Copy link

zhqfdn commented May 6, 2024

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

When used by multiple users simultaneously, it is slower. If evenly distributed across multiple GPU cards, it can improve the utilization rate of GPU cards and improve efficiency.

Tesla T4 GPU list

localhost.localdomain Mon May 6 18:41:30 2024 550.54.15
[0] Tesla T4 | 54°C, 93 % | 12238 / 15360 MB | ollama(12236M)
[1] Tesla T4 | 36°C, 0 % | 2 / 15360 MB |
[2] Tesla T4 | 30°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB |

ollama.service

`[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS='*'"
Environment="OLLAMA_MODELS=/ollama/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target`

@zhqfdn zhqfdn added the feature request New feature or request label May 6, 2024
@zhqfdn
Copy link
Author

zhqfdn commented May 6, 2024

Environment="OLLAMA_NUM_PARALLEL=5"
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Three people simultaneously used and loaded llama 3, which was very fast.
The fourth person used to load codememma, which was very slow.

localhost.localdomain Mon May 6 19:12:05 2024 550.54.15
[0] Tesla T4 | 57°C, 94 % | 13790 / 15360 MB | ollama(13788M) ---> llama 3
[1] Tesla T4 | 52°C, 9 % | 13828 / 15360 MB | ollama(13806M) ---> codememma
[2] Tesla T4 | 31°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 34°C, 0 % | 2 / 15360 MB |

@kungfu-eric
Copy link

Observing this too #4212. Horrible perf regression

@gaborkukucska
Copy link

Additionally, it would be awesome if it could also load across networked GPUs like Petals do.

This would allow communities with older GPUs to combine their vram... especially locally where multiple computers are on a local network eg schools.

@kungfu-eric
Copy link

kungfu-eric commented May 7, 2024

Confirmed that downgrading to 0.1.31 resolves this issue for me as per @zhqfdn suggestion

It must be the new GPU detection added post 0.1.31. Should it revert to the cudart GPU detection? Here's the comparison in the logs:

0.1.31

time=2024-05-07T06:54:10.510-07:00 level=INFO source=images.go:804 msg="total blobs: 62"
time=2024-05-07T06:54:10.511-07:00 level=INFO source=images.go:811 msg="total unused blobs removed: 0"
time=2024-05-07T06:54:10.512-07:00 level=INFO source=routes.go:1118 msg="Listening on [::]:7200 (version 0.1.31)"
time=2024-05-07T06:54:10.521-07:00 level=INFO source=payload_common.go:113 msg="Extracting dynamic libraries to /tmp/ollama2237025292/runners ..."
time=2024-05-07T06:54:13.192-07:00 level=INFO source=payload_common.go:140 msg="Dynamic LLM libraries [cuda_v11 rocm_v60000 cpu_avx cpu_avx2 cpu]"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-05-07T06:54:13.192-07:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libcudart.so*"
time=2024-05-07T06:54:13.193-07:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama2237025292/runners/cuda_v11/libcudart.so.11.0 /usr/local/cuda/lib64/libcudart.so.11.7.60]"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-05-07T06:54:14.328-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-05-07T06:54:14.969-07:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"

0.1.34rc1

2024/05/07 06:38:53 routes.go:989: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:897 msg="total blobs: 62"
time=2024-05-07T06:38:53.154-07:00 level=INFO source=images.go:904 msg="total unused blobs removed: 0"
time=2024-05-07T06:38:53.155-07:00 level=INFO source=routes.go:1034 msg="Listening on 127.0.0.1:11434 (version 0.1.34-rc1)"
time=2024-05-07T06:38:53.163-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1226851952/runners
time=2024-05-07T06:38:55.816-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11 rocm_v60002 cpu cpu_avx]"
time=2024-05-07T06:38:55.816-07:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-07T06:38:56.036-07:00 level=INFO source=gpu.go:127 msg="detected GPUs" count=3 library=/usr/lib/x86_64-linux-gnu/libcuda.so.515.43.04
time=2024-05-07T06:38:56.036-07:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

I tested 0.1.33 ga and it had the same issue as per #4212

@dhiltgen dhiltgen self-assigned this May 8, 2024
@dhiltgen
Copy link
Collaborator

dhiltgen commented May 8, 2024

Before v0.1.32, when loading a model, it would be evenly distributed across all GPU cards to improve the use of GPU cards. In v0.1.32 and v0.1.33, it was found that loading a model would automatically use one card.

@zhqfdn this was an intentional design change. Based on our performance testing, if a model can fit in one GPU, we saw better performance loading it into 1 instead of unnecessarily splitting it into multiples.

I'll put up a PR to support tuning this behavior, but at least based on my initial testing, it doesn't yield a performance benefit. If you can test and validate you see performance benefit that would help justify merging this.

@dhiltgen dhiltgen linked a pull request May 8, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants