Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update HIP_UMA #7399 #7414

Merged
merged 2 commits into from
May 27, 2024
Merged

update HIP_UMA #7399 #7414

merged 2 commits into from
May 27, 2024

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented May 20, 2024

Add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.

On my Ryzen 7940HS I get some speed up:

build with:

# gfx1103 not supported use gfx1101 in place:
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1101

run bench with:

HSA_OVERRIDE_GFX_VERSION=11.0.1 ./main -m ~/LLM/mistral-7b-instruct-v0.2.Q8_0.gguf -ngl 999 --temp 0 -c 2048 -p "[INST] ... [/INST]"

I get:

# before PR:
llama_print_timings:        load time =    3386,19 ms
llama_print_timings:      sample time =      12,14 ms /   609 runs   (    0,02 ms per token, 50177,14 tokens per second)
llama_print_timings: prompt eval time =   15051,81 ms /  1466 tokens (   10,27 ms per token,    97,40 tokens per second)
llama_print_timings:        eval time =  100420,23 ms /   608 runs   (  165,16 ms per token,     6,05 tokens per second)
llama_print_timings:       total time =  115547,81 ms /  2074 tokens

# after PR:
llama_print_timings:        load time =    2606,81 ms
llama_print_timings:      sample time =      12,19 ms /   609 runs   (    0,02 ms per token, 49946,69 tokens per second)
llama_print_timings: prompt eval time =    8120,08 ms /  1466 tokens (    5,54 ms per token,   180,54 tokens per second)
llama_print_timings:        eval time =   65652,40 ms /   608 runs   (  107,98 ms per token,     9,26 tokens per second)
llama_print_timings:       total time =   73841,90 ms /  2074 tokens

add use of hipMemAdviseSetCoarseGrain when LLAMA_HIP_UMA is enable.
- get x2 on prompte eval and x1.5 on token gen with rocm6.0 on ryzen 7940HX iGPU (780M/gfx1103)
Copy link
Contributor

github-actions bot commented May 20, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 553 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8425.21ms p(95)=20110.68ms fails=, finish reason: stop=500 truncated=53
  • Prompt processing (pp): avg=94.29tk/s p(95)=389.58tk/s
  • Token generation (tg): avg=47.34tk/s p(95)=48.07tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feature/hip_uma commit=a6a1abd98ebd8eea215e47c1b6547f1404ad9b7a

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 392.37, 392.37, 392.37, 392.37, 392.37, 654.13, 654.13, 654.13, 654.13, 654.13, 673.37, 673.37, 673.37, 673.37, 673.37, 756.2, 756.2, 756.2, 756.2, 756.2, 755.55, 755.55, 755.55, 755.55, 755.55, 755.59, 755.59, 755.59, 755.59, 755.59, 777.57, 777.57, 777.57, 777.57, 777.57, 786.13, 786.13, 786.13, 786.13, 786.13, 801.78, 801.78, 801.78, 801.78, 801.78, 804.97, 804.97, 804.97, 804.97, 804.97, 837.27, 837.27, 837.27, 837.27, 837.27, 862.0, 862.0, 862.0, 862.0, 862.0, 857.17, 857.17, 857.17, 857.17, 857.17, 872.09, 872.09, 872.09, 872.09, 872.09, 872.03, 872.03, 872.03, 872.03, 872.03, 875.0, 875.0, 875.0, 875.0, 875.0, 874.74, 874.74, 874.74, 874.74, 874.74, 887.3, 887.3, 887.3, 887.3, 887.3, 889.35, 889.35, 889.35, 889.35, 889.35, 892.84, 892.84, 892.84, 892.84, 892.84, 896.93, 896.93, 896.93, 896.93, 896.93, 897.03, 897.03, 897.03, 897.03, 897.03, 909.12, 909.12, 909.12, 909.12, 909.12, 911.9, 911.9, 911.9, 911.9, 911.9, 911.14, 911.14, 911.14, 911.14, 911.14, 911.25, 911.25, 911.25, 911.25, 911.25, 919.62, 919.62, 919.62, 919.62, 919.62, 914.66, 914.66, 914.66, 914.66, 914.66, 913.79, 913.79, 913.79, 913.79, 913.79, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 916.56, 913.47, 913.47, 913.47, 913.47, 913.47, 915.54, 915.54, 915.54, 915.54, 915.54, 894.06, 894.06, 894.06, 894.06, 894.06, 891.22, 891.22, 891.22, 891.22, 891.22, 887.76, 887.76, 887.76, 887.76, 887.76, 876.14, 876.14, 876.14, 876.14, 876.14, 874.75, 874.75, 874.75, 874.75, 874.75, 879.34, 879.34, 879.34, 879.34, 879.34, 880.19, 880.19, 880.19, 880.19, 880.19, 854.69, 854.69, 854.69, 854.69, 854.69, 846.31, 846.31, 846.31, 846.31, 846.31, 823.01, 823.01, 823.01, 823.01, 823.01, 821.85, 821.85, 821.85, 821.85, 821.85, 819.72, 819.72, 819.72, 819.72, 819.72, 823.26, 823.26, 823.26, 823.26, 823.26, 825.48, 825.48, 825.48, 825.48, 825.48, 827.65, 827.65, 827.65, 827.65, 827.65, 830.84, 830.84, 830.84, 830.84, 830.84, 834.03, 834.03, 834.03, 834.03, 834.03, 837.27, 837.27, 837.27, 837.27, 837.27, 836.07, 836.07, 836.07, 836.07, 836.07, 835.51, 835.51, 835.51, 835.51, 835.51, 825.69, 825.69, 825.69, 825.69, 825.69, 826.18, 826.18, 826.18, 826.18, 826.18, 826.13, 826.13, 826.13, 826.13, 826.13, 827.08, 827.08, 827.08, 827.08, 827.08, 829.14, 829.14, 829.14, 829.14, 829.14, 832.39, 832.39, 832.39, 832.39, 832.39, 832.62, 832.62, 832.62, 832.62, 832.62, 832.17]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.76, 41.76, 41.76, 41.76, 41.76, 36.93, 36.93, 36.93, 36.93, 36.93, 31.62, 31.62, 31.62, 31.62, 31.62, 33.95, 33.95, 33.95, 33.95, 33.95, 33.92, 33.92, 33.92, 33.92, 33.92, 34.45, 34.45, 34.45, 34.45, 34.45, 35.46, 35.46, 35.46, 35.46, 35.46, 35.78, 35.78, 35.78, 35.78, 35.78, 35.73, 35.73, 35.73, 35.73, 35.73, 35.58, 35.58, 35.58, 35.58, 35.58, 35.14, 35.14, 35.14, 35.14, 35.14, 34.84, 34.84, 34.84, 34.84, 34.84, 33.19, 33.19, 33.19, 33.19, 33.19, 32.13, 32.13, 32.13, 32.13, 32.13, 30.95, 30.95, 30.95, 30.95, 30.95, 29.59, 29.59, 29.59, 29.59, 29.59, 29.09, 29.09, 29.09, 29.09, 29.09, 29.52, 29.52, 29.52, 29.52, 29.52, 29.3, 29.3, 29.3, 29.3, 29.3, 29.31, 29.31, 29.31, 29.31, 29.31, 29.42, 29.42, 29.42, 29.42, 29.42, 29.48, 29.48, 29.48, 29.48, 29.48, 29.86, 29.86, 29.86, 29.86, 29.86, 29.72, 29.72, 29.72, 29.72, 29.72, 29.86, 29.86, 29.86, 29.86, 29.86, 30.02, 30.02, 30.02, 30.02, 30.02, 30.09, 30.09, 30.09, 30.09, 30.09, 29.89, 29.89, 29.89, 29.89, 29.89, 30.01, 30.01, 30.01, 30.01, 30.01, 30.23, 30.23, 30.23, 30.23, 30.23, 30.38, 30.38, 30.38, 30.38, 30.38, 30.53, 30.53, 30.53, 30.53, 30.53, 30.72, 30.72, 30.72, 30.72, 30.72, 30.86, 30.86, 30.86, 30.86, 30.86, 30.68, 30.68, 30.68, 30.68, 30.68, 30.62, 30.62, 30.62, 30.62, 30.62, 30.45, 30.45, 30.45, 30.45, 30.45, 30.53, 30.53, 30.53, 30.53, 30.53, 30.73, 30.73, 30.73, 30.73, 30.73, 30.83, 30.83, 30.83, 30.83, 30.83, 30.93, 30.93, 30.93, 30.93, 30.93, 30.66, 30.66, 30.66, 30.66, 30.66, 30.58, 30.58, 30.58, 30.58, 30.58, 30.44, 30.44, 30.44, 30.44, 30.44, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.13, 29.16, 29.16, 29.16, 29.16, 29.16, 29.29, 29.29, 29.29, 29.29, 29.29, 29.3, 29.3, 29.3, 29.3, 29.3, 29.39, 29.39, 29.39, 29.39, 29.39, 29.45, 29.45, 29.45, 29.45, 29.45, 29.31, 29.31, 29.31, 29.31, 29.31, 29.21, 29.21, 29.21, 29.21, 29.21, 29.17, 29.17, 29.17, 29.17, 29.17, 29.23, 29.23, 29.23, 29.23, 29.23, 29.27, 29.27, 29.27, 29.27, 29.27, 29.38, 29.38, 29.38, 29.38, 29.38, 29.54, 29.54, 29.54, 29.54, 29.54, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.67, 29.66]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.44, 0.44, 0.44, 0.44, 0.44, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.37, 0.37, 0.37, 0.37, 0.37, 0.28, 0.28, 0.28, 0.28, 0.28, 0.46, 0.46, 0.46, 0.46, 0.46, 0.45, 0.45, 0.45, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.31, 0.31, 0.31, 0.31, 0.31, 0.26, 0.26, 0.26, 0.26, 0.26, 0.05, 0.05, 0.05, 0.05, 0.05, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.39, 0.39, 0.39, 0.39, 0.39, 0.58, 0.58, 0.58, 0.58, 0.58, 0.55, 0.55, 0.55, 0.55, 0.55, 0.57, 0.57, 0.57, 0.57, 0.57, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 553 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716743199 --> 1716743819
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0]
                    

@mofosyne mofosyne added enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level performance Speed related topics labels May 20, 2024
@jammm
Copy link
Contributor

jammm commented May 20, 2024

I would definitely benchmark this on Windows and Linux separately. I think hipMallocManaged may perform slower on Windows, though my knowledge could be outdated.

@Djip007
Copy link
Contributor Author

Djip007 commented May 20, 2024

Is "HIP_UMA" possible on windows?
I do not have windows to bench it sorry. But if it is possible I like to see what we can get (or not get...)

https://hipsolver.readthedocs.io/en/rocm-6.1.1/conceptual/gpu-memory.html#coherence
AMD suggest that if it can be use it must be use.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2024
@Djip007
Copy link
Contributor Author

Djip007 commented May 20, 2024

I would definitely benchmark this on Windows and Linux separately. I think hipMallocManaged may perform slower on Windows, though my knowledge could be outdated.

First hipMallocManaged is use only with -DLLAMA_HIP_UMA=ON and is slower on dGPU in all case.
On linux I made some benchmark with iGPU with simple rocblas (hgemm).
With hipMallocManaged and default Fine-grained it is slower than use hipMalloc but with Coarse-grained it have the same speed. So on Linux it made possible of use all RAM (limite define with boot-kernel option) and not reserve large VRAM on bios (if possible...) without penality.

Note: I be happy if someone that can have large VRAM on Ryzen 7940HS can bench both ... 🤞

@slaren slaren added the need feedback Testing and feedback with results are needed label May 23, 2024
@slaren
Copy link
Collaborator

slaren commented May 26, 2024

I have made some minor changes to the code to simplify it a bit and be more consistent, if everything looks good let's merge this.

@Djip007
Copy link
Contributor Author

Djip007 commented May 27, 2024

I have made some minor changes to the code to simplify it a bit and be more consistent, if everything looks good let's merge this.

Just have time to read it but looks good for me.

@Djip007
Copy link
Contributor Author

Djip007 commented May 27, 2024

or like that...

static inline cudaError_t ggml_cuda_device_malloc(void ** ptr, size_t size, int device) {
    auto res = cudaMalloc(ptr, size);
#if defined(GGML_USE_HIPBLAS) && defined(GGML_HIP_UMA)
    // if Not enough space on VRAM => try on UMA
    if (res == hipErrorOutOfMemory) {
        GGML_CUDA_LOG_INFO("  Device %d: can not alloc %d MB on VRAM try alloc on HMM\n", device, (uint32_t)(size / 1024 / 1024));
        res = hipMallocManaged(ptr, size);
        if (res == hipSuccess) {
            // Config the memory for best speed (It's not supposed to fail)
            CUDA_CHECK(hipMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device));
        }
    }
#endif
    return res;
}

@slaren slaren merged commit 852aafb into ggerganov:master May 27, 2024
70 of 71 checks passed
@Djip007
Copy link
Contributor Author

Djip007 commented May 28, 2024

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ggml changes relating to the ggml tensor library for machine learning need feedback Testing and feedback with results are needed Nvidia GPU Issues specific to Nvidia GPUs performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants