update HIP_UMA #7399 #7414

Djip007 · 2024-05-20T11:44:04Z

Add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.

On my Ryzen 7940HS I get some speed up:

build with:

# gfx1103 not supported use gfx1101 in place:
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1101

run bench with:

HSA_OVERRIDE_GFX_VERSION=11.0.1 ./main -m ~/LLM/mistral-7b-instruct-v0.2.Q8_0.gguf -ngl 999 --temp 0 -c 2048 -p "[INST] ... [/INST]"

I get:

# before PR:
llama_print_timings:        load time =    3386,19 ms
llama_print_timings:      sample time =      12,14 ms /   609 runs   (    0,02 ms per token, 50177,14 tokens per second)
llama_print_timings: prompt eval time =   15051,81 ms /  1466 tokens (   10,27 ms per token,    97,40 tokens per second)
llama_print_timings:        eval time =  100420,23 ms /   608 runs   (  165,16 ms per token,     6,05 tokens per second)
llama_print_timings:       total time =  115547,81 ms /  2074 tokens

# after PR:
llama_print_timings:        load time =    2606,81 ms
llama_print_timings:      sample time =      12,19 ms /   609 runs   (    0,02 ms per token, 49946,69 tokens per second)
llama_print_timings: prompt eval time =    8120,08 ms /  1466 tokens (    5,54 ms per token,   180,54 tokens per second)
llama_print_timings:        eval time =   65652,40 ms /   608 runs   (  107,98 ms per token,     9,26 tokens per second)
llama_print_timings:       total time =   73841,90 ms /  2074 tokens

add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)

github-actions · 2024-05-20T12:12:58Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8425.21ms p(95)=20110.68ms fails=, finish reason: stop=500 truncated=53
Prompt processing (pp): avg=94.29tk/s p(95)=389.58tk/s
Token generation (tg): avg=47.34tk/s p(95)=48.07tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feature/hip_uma commit=a6a1abd98ebd8eea215e47c1b6547f1404ad9b7a

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 392.37, 392.37, 392.37, 392.37, 392.37, 654.13, 654.13, 654.13, 654.13, 654.13, 673.37, 673.37, 673.37, 673.37, 673.37, 756.2, 756.2, 756.2, 756.2, 756.2, 755.55, 755.55, 755.55, 755.55, 755.55, 755.59, 755.59, 755.59, 755.59, 755.59, 777.57, 777.57, 777.57, 777.57, 777.57, 786.13, 786.13, 786.13, 786.13, 786.13, 801.78, 801.78, 801.78, 801.78, 801.78, 804.97, 804.97, 804.97, 804.97, 804.97, 837.27, 837.27, 837.27, 837.27, 837.27, 862.0, 862.0, 862.0, 862.0, 862.0, 857.17, 857.17, 857.17, 857.17, 857.17, 872.09, 872.09, 872.09, 872.09, 872.09, 872.03, 872.03, 872.03, 872.03, 872.03, 875.0, 875.0, 875.0, 875.0, 875.0, 874.74, 874.74, 874.74, 874.74, 874.74, 887.3, 887.3, 887.3, 887.3, 887.3, 889.35, 889.35, 889.35, 889.35, 889.35, 892.84, 892.84, 892.84, 892.84, 892.84, 896.93, 896.93, 896.93, 896.93, 896.93, 897.03, 897.03, 897.03, 897.03, 897.03, 909.12, 909.12, 909.12, 909.12, 909.12, 911.9, 911.9, 911.9, 911.9, 911.9, 911.14, 911.14, 911.14, 911.14, 911.14, 911.25, 911.25, 911.25, 911.25, 911.25, 919.62, 919.62, 919.62, 919.62, 919.62, 914.66, 914.66, 914.66, 914.66, 914.66, 913.79, 913.79, 913.79, 913.79, 913.79, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 913.47, 913.47, 913.47, 913.47, 913.47, 915.54, 915.54, 915.54, 915.54, 915.54, 894.06, 894.06, 894.06, 894.06, 894.06, 891.22, 891.22, 891.22, 891.22, 891.22, 887.76, 887.76, 887.76, 887.76, 887.76, 876.14, 876.14, 876.14, 876.14, 876.14, 874.75, 874.75, 874.75, 874.75, 874.75, 879.34, 879.34, 879.34, 879.34, 879.34, 880.19, 880.19, 880.19, 880.19, 880.19, 854.69, 854.69, 854.69, 854.69, 854.69, 846.31, 846.31, 846.31, 846.31, 846.31, 823.01, 823.01, 823.01, 823.01, 823.01, 821.85, 821.85, 821.85, 821.85, 821.85, 819.72, 819.72, 819.72, 819.72, 819.72, 823.26, 823.26, 823.26, 823.26, 823.26, 825.48, 825.48, 825.48, 825.48, 825.48, 827.65, 827.65, 827.65, 827.65, 827.65, 830.84, 830.84, 830.84, 830.84, 830.84, 834.03, 834.03, 834.03, 834.03, 834.03, 837.27, 837.27, 837.27, 837.27, 837.27, 836.07, 836.07, 836.07, 836.07, 836.07, 835.51, 835.51, 835.51, 835.51, 835.51, 825.69, 825.69, 825.69, 825.69, 825.69, 826.18, 826.18, 826.18, 826.18, 826.18, 826.13, 826.13, 826.13, 826.13, 826.13, 827.08, 827.08, 827.08, 827.08, 827.08, 829.14, 829.14, 829.14, 829.14, 829.14, 832.39, 832.39, 832.39, 832.39, 832.39, 832.62, 832.62, 832.62, 832.62, 832.62, 832.17]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.76, 41.76, 41.76, 41.76, 41.76, 36.93, 36.93, 36.93, 36.93, 36.93, 31.62, 31.62, 31.62, 31.62, 31.62, 33.95, 33.95, 33.95, 33.95, 33.95, 33.92, 33.92, 33.92, 33.92, 33.92, 34.45, 34.45, 34.45, 34.45, 34.45, 35.46, 35.46, 35.46, 35.46, 35.46, 35.78, 35.78, 35.78, 35.78, 35.78, 35.73, 35.73, 35.73, 35.73, 35.73, 35.58, 35.58, 35.58, 35.58, 35.58, 35.14, 35.14, 35.14, 35.14, 35.14, 34.84, 34.84, 34.84, 34.84, 34.84, 33.19, 33.19, 33.19, 33.19, 33.19, 32.13, 32.13, 32.13, 32.13, 32.13, 30.95, 30.95, 30.95, 30.95, 30.95, 29.59, 29.59, 29.59, 29.59, 29.59, 29.09, 29.09, 29.09, 29.09, 29.09, 29.52, 29.52, 29.52, 29.52, 29.52, 29.3, 29.3, 29.3, 29.3, 29.3, 29.31, 29.31, 29.31, 29.31, 29.31, 29.42, 29.42, 29.42, 29.42, 29.42, 29.48, 29.48, 29.48, 29.48, 29.48, 29.86, 29.86, 29.86, 29.86, 29.86, 29.72, 29.72, 29.72, 29.72, 29.72, 29.86, 29.86, 29.86, 29.86, 29.86, 30.02, 30.02, 30.02, 30.02, 30.02, 30.09, 30.09, 30.09, 30.09, 30.09, 29.89, 29.89, 29.89, 29.89, 29.89, 30.01, 30.01, 30.01, 30.01, 30.01, 30.23, 30.23, 30.23, 30.23, 30.23, 30.38, 30.38, 30.38, 30.38, 30.38, 30.53, 30.53, 30.53, 30.53, 30.53, 30.72, 30.72, 30.72, 30.72, 30.72, 30.86, 30.86, 30.86, 30.86, 30.86, 30.68, 30.68, 30.68, 30.68, 30.68, 30.62, 30.62, 30.62, 30.62, 30.62, 30.45, 30.45, 30.45, 30.45, 30.45, 30.53, 30.53, 30.53, 30.53, 30.53, 30.73, 30.73, 30.73, 30.73, 30.73, 30.83, 30.83, 30.83, 30.83, 30.83, 30.93, 30.93, 30.93, 30.93, 30.93, 30.66, 30.66, 30.66, 30.66, 30.66, 30.58, 30.58, 30.58, 30.58, 30.58, 30.44, 30.44, 30.44, 30.44, 30.44, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.16, 29.16, 29.16, 29.16, 29.16, 29.29, 29.29, 29.29, 29.29, 29.29, 29.3, 29.3, 29.3, 29.3, 29.3, 29.39, 29.39, 29.39, 29.39, 29.39, 29.45, 29.45, 29.45, 29.45, 29.45, 29.31, 29.31, 29.31, 29.31, 29.31, 29.21, 29.21, 29.21, 29.21, 29.21, 29.17, 29.17, 29.17, 29.17, 29.17, 29.23, 29.23, 29.23, 29.23, 29.23, 29.27, 29.27, 29.27, 29.27, 29.27, 29.38, 29.38, 29.38, 29.38, 29.38, 29.54, 29.54, 29.54, 29.54, 29.54, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.66]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.44, 0.44, 0.44, 0.44, 0.44, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.37, 0.37, 0.37, 0.37, 0.37, 0.28, 0.28, 0.28, 0.28, 0.28, 0.46, 0.46, 0.46, 0.46, 0.46, 0.45, 0.45, 0.45, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.05, 0.05, 0.05, 0.05, 0.05, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.39, 0.39, 0.39, 0.39, 0.39, 0.58, 0.58, 0.58, 0.58, 0.58, 0.55, 0.55, 0.55, 0.55, 0.55, 0.57, 0.57, 0.57, 0.57, 0.57, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0]

jammm · 2024-05-20T12:26:55Z

I would definitely benchmark this on Windows and Linux separately. I think hipMallocManaged may perform slower on Windows, though my knowledge could be outdated.

Djip007 · 2024-05-20T12:37:17Z

Is "HIP_UMA" possible on windows?
I do not have windows to bench it sorry. But if it is possible I like to see what we can get (or not get...)

https://hipsolver.readthedocs.io/en/rocm-6.1.1/conceptual/gpu-memory.html#coherence
AMD suggest that if it can be use it must be use.

Djip007 · 2024-05-20T13:15:02Z

I would definitely benchmark this on Windows and Linux separately. I think hipMallocManaged may perform slower on Windows, though my knowledge could be outdated.

First hipMallocManaged is use only with -DLLAMA_HIP_UMA=ON and is slower on dGPU in all case.
On linux I made some benchmark with iGPU with simple rocblas (hgemm).
With hipMallocManaged and default Fine-grained it is slower than use hipMalloc but with Coarse-grained it have the same speed. So on Linux it made possible of use all RAM (limite define with boot-kernel option) and not reserve large VRAM on bios (if possible...) without penality.

Note: I be happy if someone that can have large VRAM on Ryzen 7940HS can bench both ... 🤞

ggml-cuda.cu

slaren · 2024-05-26T16:49:15Z

I have made some minor changes to the code to simplify it a bit and be more consistent, if everything looks good let's merge this.

Djip007 · 2024-05-27T00:39:37Z

I have made some minor changes to the code to simplify it a bit and be more consistent, if everything looks good let's merge this.

Just have time to read it but looks good for me.

Djip007 · 2024-05-27T00:43:52Z

or like that...

static inline cudaError_t ggml_cuda_device_malloc(void ** ptr, size_t size, int device) {
    auto res = cudaMalloc(ptr, size);
#if defined(GGML_USE_HIPBLAS) && defined(GGML_HIP_UMA)
    // if Not enough space on VRAM => try on UMA
    if (res == hipErrorOutOfMemory) {
        GGML_CUDA_LOG_INFO("  Device %d: can not alloc %d MB on VRAM try alloc on HMM\n", device, (uint32_t)(size / 1024 / 1024));
        res = hipMallocManaged(ptr, size);
        if (res == hipSuccess) {
            // Config the memory for best speed (It's not supposed to fail)
            CUDA_CHECK(hipMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device));
        }
    }
#endif
    return res;
}

Djip007 · 2024-05-28T00:20:02Z

👍

update HIP_UMA ggerganov#7399

b2c0f7f

add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable. - get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)

mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level performance Speed related topics labels May 20, 2024

Djip007 mentioned this pull request May 20, 2024

AMD gfx1103 laptop GPU returning HIPBLAS_STATUS_UNKNOWN Mozilla-Ocho/llamafile#188

Closed

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024

jdecourval reviewed May 21, 2024

View reviewed changes

ggml-cuda.cu Show resolved Hide resolved

slaren added the need feedback Testing and feedback with results are needed label May 23, 2024

Djip007 mentioned this pull request May 24, 2024

GPU offloading not working on system with AMD 5900HX CPU Mozilla-Ocho/llamafile#386

Open

simplify code, more consistent style

a6a1abd

slaren approved these changes May 26, 2024

View reviewed changes

slaren merged commit 852aafb into ggerganov:master May 27, 2024
70 of 71 checks passed

atelepov mentioned this pull request May 28, 2024

Add support AMD gfx1103 780M APU GPU and HIP_UMA lmstudio-ai/lmstudio-bug-tracker#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update HIP_UMA #7399 #7414

update HIP_UMA #7399 #7414

Djip007 commented May 20, 2024

github-actions bot commented May 20, 2024 •

edited

jammm commented May 20, 2024 •

edited

Djip007 commented May 20, 2024

Djip007 commented May 20, 2024

slaren commented May 26, 2024

Djip007 commented May 27, 2024

Djip007 commented May 27, 2024 •

edited

Djip007 commented May 28, 2024

update HIP_UMA #7399 #7414

update HIP_UMA #7399 #7414

Conversation

Djip007 commented May 20, 2024

github-actions bot commented May 20, 2024 • edited

jammm commented May 20, 2024 • edited

Djip007 commented May 20, 2024

Djip007 commented May 20, 2024

slaren commented May 26, 2024

Djip007 commented May 27, 2024

Djip007 commented May 27, 2024 • edited

Djip007 commented May 28, 2024

github-actions bot commented May 20, 2024 •

edited

jammm commented May 20, 2024 •

edited

Djip007 commented May 27, 2024 •

edited