OpenELM support #7359

icecream95 · 2024-05-18T07:53:25Z

Fixes: #6868.

Thanks to @joshcarp for an initial try at doing this (#6986), it was very helpful as a source to copy-paste from and check against.

Currently a bunch of the configuration is hardcoded into llama.cpp, so only the 270M model works at this point.

The ffn_up tensors in the converted model are actually concatenations of ffn_gate and ffn_up, perhaps the conversion script should separate them out?

The 270M model is impressively fast, and works fine for generation, but "Chat" mode in ./server doesn't really work well. Perhaps that's just because it hasn't been finetuned for that? I'm not really sure.

Fix formatting

icecream95 · 2024-05-18T09:31:21Z

It looks like context shift currently causes crashes, because build_k_shift uses the false number of heads in the .gguf.

A few other functions seem like they will be broken as well.

github-actions · 2024-05-19T07:49:09Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 512 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=9168.82ms p(95)=23246.02ms fails=, finish reason: stop=444 truncated=68
Prompt processing (pp): avg=111.45tk/s p(95)=515.19tk/s
Token generation (tg): avg=31.54tk/s p(95)=46.05tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=openelm commit=60b2e1b9c529f74f5bf881b05a6247ff6f58a71c

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 452.86, 452.86, 452.86, 452.86, 452.86, 896.59, 896.59, 896.59, 896.59, 896.59, 871.99, 871.99, 871.99, 871.99, 871.99, 877.49, 877.49, 877.49, 877.49, 877.49, 882.82, 882.82, 882.82, 882.82, 882.82, 887.52, 887.52, 887.52, 887.52, 887.52, 875.85, 875.85, 875.85, 875.85, 875.85, 890.18, 890.18, 890.18, 890.18, 890.18, 891.96, 891.96, 891.96, 891.96, 891.96, 903.52, 903.52, 903.52, 903.52, 903.52, 927.23, 927.23, 927.23, 927.23, 927.23, 913.62, 913.62, 913.62, 913.62, 913.62, 910.92, 910.92, 910.92, 910.92, 910.92, 926.37, 926.37, 926.37, 926.37, 926.37, 909.03, 909.03, 909.03, 909.03, 909.03, 912.02, 912.02, 912.02, 912.02, 912.02, 909.29, 909.29, 909.29, 909.29, 909.29, 892.43, 892.43, 892.43, 892.43, 892.43, 893.16, 893.16, 893.16, 893.16, 893.16, 887.09, 887.09, 887.09, 887.09, 887.09, 892.09, 892.09, 892.09, 892.09, 892.09, 891.43, 891.43, 891.43, 891.43, 891.43, 892.08, 892.08, 892.08, 892.08, 892.08, 886.73, 886.73, 886.73, 886.73, 886.73, 883.02, 883.02, 883.02, 883.02, 883.02, 882.49, 882.49, 882.49, 882.49, 882.49, 895.2, 895.2, 895.2, 895.2, 895.2, 891.89, 891.89, 891.89, 891.89, 891.89, 890.09, 890.09, 890.09, 890.09, 890.09, 889.3, 889.3, 889.3, 889.3, 889.3, 893.44, 893.44, 893.44, 893.44, 893.44, 892.14, 892.14, 892.14, 892.14, 892.14, 890.24, 890.24, 890.24, 890.24, 890.24, 893.23, 893.23, 893.23, 893.23, 893.23, 900.85, 900.85, 900.85, 900.85, 900.85, 901.77, 901.77, 901.77, 901.77, 901.77, 904.26, 904.26, 904.26, 904.26, 904.26, 907.47, 907.47, 907.47, 907.47, 907.47, 905.15, 905.15, 905.15, 905.15, 905.15, 901.81, 901.81, 901.81, 901.81, 901.81, 904.25, 904.25, 904.25, 904.25, 904.25, 905.24, 905.24, 905.24, 905.24, 905.24, 905.87, 905.87, 905.87, 905.87, 905.87, 909.23, 909.23, 909.23, 909.23, 909.23, 905.0, 905.0, 905.0, 905.0, 905.0, 905.48, 905.48, 905.48, 905.48, 905.48, 903.32, 903.32, 903.32, 903.32, 903.32, 901.1, 901.1, 901.1, 901.1, 901.1, 896.0, 896.0, 896.0, 896.0, 896.0, 899.86, 899.86, 899.86, 899.86, 899.86, 902.19, 902.19, 902.19, 902.19, 902.19, 901.36, 901.36, 901.36, 901.36, 901.36, 904.94, 904.94, 904.94, 904.94, 904.94, 903.99, 903.99, 903.99, 903.99, 903.99, 903.29, 903.29, 903.29, 903.29, 903.29, 906.62, 906.62, 906.62, 906.62, 906.62, 907.12, 907.12, 907.12, 907.12, 907.12, 910.26, 910.26, 910.26, 910.26, 910.26, 909.96, 909.96, 909.96, 909.96, 909.96, 909.5, 909.5, 909.5, 909.5, 909.5, 908.99, 908.99, 908.99]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.8, 38.8, 38.8, 38.8, 38.8, 31.58, 31.58, 31.58, 31.58, 31.58, 28.54, 28.54, 28.54, 28.54, 28.54, 28.83, 28.83, 28.83, 28.83, 28.83, 30.45, 30.45, 30.45, 30.45, 30.45, 30.37, 30.37, 30.37, 30.37, 30.37, 31.59, 31.59, 31.59, 31.59, 31.59, 32.59, 32.59, 32.59, 32.59, 32.59, 33.03, 33.03, 33.03, 33.03, 33.03, 33.78, 33.78, 33.78, 33.78, 33.78, 33.73, 33.73, 33.73, 33.73, 33.73, 33.98, 33.98, 33.98, 33.98, 33.98, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 31.24, 31.24, 31.24, 31.24, 31.24, 30.47, 30.47, 30.47, 30.47, 30.47, 30.72, 30.72, 30.72, 30.72, 30.72, 30.84, 30.84, 30.84, 30.84, 30.84, 30.64, 30.64, 30.64, 30.64, 30.64, 30.15, 30.15, 30.15, 30.15, 30.15, 29.96, 29.96, 29.96, 29.96, 29.96, 29.7, 29.7, 29.7, 29.7, 29.7, 29.91, 29.91, 29.91, 29.91, 29.91, 29.94, 29.94, 29.94, 29.94, 29.94, 29.73, 29.73, 29.73, 29.73, 29.73, 30.07, 30.07, 30.07, 30.07, 30.07, 30.03, 30.03, 30.03, 30.03, 30.03, 30.0, 30.0, 30.0, 30.0, 30.0, 29.92, 29.92, 29.92, 29.92, 29.92, 30.08, 30.08, 30.08, 30.08, 30.08, 30.26, 30.26, 30.26, 30.26, 30.26, 30.4, 30.4, 30.4, 30.4, 30.4, 30.5, 30.5, 30.5, 30.5, 30.5, 30.57, 30.57, 30.57, 30.57, 30.57, 30.55, 30.55, 30.55, 30.55, 30.55, 30.48, 30.48, 30.48, 30.48, 30.48, 29.94, 29.94, 29.94, 29.94, 29.94, 29.86, 29.86, 29.86, 29.86, 29.86, 29.53, 29.53, 29.53, 29.53, 29.53, 29.36, 29.36, 29.36, 29.36, 29.36, 29.49, 29.49, 29.49, 29.49, 29.49, 29.59, 29.59, 29.59, 29.59, 29.59, 29.71, 29.71, 29.71, 29.71, 29.71, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.74, 29.74, 29.74, 29.74, 29.74, 29.42, 29.42, 29.42, 29.42, 29.42, 29.16, 29.16, 29.16, 29.16, 29.16, 27.96, 27.96, 27.96, 27.96, 27.96, 28.02, 28.02, 28.02, 28.02, 28.02, 28.07, 28.07, 28.07, 28.07, 28.07, 28.26, 28.26, 28.26, 28.26, 28.26, 28.27, 28.27, 28.27, 28.27, 28.27, 28.3, 28.3, 28.3, 28.3, 28.3, 28.34, 28.34, 28.34, 28.34, 28.34, 28.31, 28.31, 28.31, 28.31, 28.31, 28.3, 28.3, 28.3, 28.3, 28.3, 28.22, 28.22, 28.22, 28.22, 28.22, 28.27, 28.27, 28.27, 28.27, 28.27, 28.34, 28.34, 28.34, 28.34, 28.34, 28.41, 28.41, 28.41]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17, 0.17, 0.17, 0.17, 0.17, 0.43, 0.43, 0.43, 0.43, 0.43, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.37, 0.37, 0.37, 0.37, 0.37, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.35, 0.35, 0.35, 0.35, 0.35, 0.29, 0.29, 0.29, 0.29, 0.29, 0.45, 0.45, 0.45, 0.45, 0.45, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.44, 0.44, 0.44, 0.44, 0.44, 0.54, 0.54, 0.54, 0.54, 0.54, 0.66, 0.66, 0.66, 0.66, 0.66, 0.58, 0.58, 0.58, 0.58, 0.58, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0]

ggerganov · 2024-05-19T16:21:32Z

The ffn_up tensors in the converted model are actually concatenations of ffn_gate and ffn_up, perhaps the conversion script should separate them out?

We already have this logic for the Refact models:

llama.cpp/convert-hf-to-gguf.py

Lines 1141 to 1143 in 8513724

    
           elif name == f"transformer.h.{bid}.mlp.gate_up_proj.weight": 
        
               tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE, bid), data_torch[:ff_dim])) 
        
               tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP, bid), data_torch[ff_dim:]))

You can try to reuse it in a similar way for OpenELM

The 270M model is impressively fast, and works fine for generation, but "Chat" mode in ./server doesn't really work well. Perhaps that's just because it hasn't been finetuned for that? I'm not really sure.

Have you ran perplexity runs with this model?

It looks like context shift currently causes crashes, because build_k_shift uses the false number of heads in the .gguf.

A few other functions seem like they will be broken as well.

We'll probably need to generalize the head number to be determined per layer. Do you need to some assistance with that?

icecream95 · 2024-05-30T10:02:24Z

I've been quite tired recently, so it might be a while before I'm able to come back to this.

I see that @jart's #7445 has already been merged with similar modifications to llama_model_type_name, but I think git merge will do the right thing here without needing to change that commit.

joshcarp · 2024-05-30T22:35:15Z

@icecream95 might jump back on this cause im curious to where i got stuck

Initial OpenELM support (270M only so far)

217d8d7

icecream95 marked this pull request as draft May 18, 2024 07:53

icecream95 changed the title ~~Draft: OpenELM support~~ OpenELM support May 18, 2024

Fill out missing entries in llama_model_type_name

aaabe2e

icecream95 force-pushed the openelm branch from 927a634 to aaabe2e Compare May 18, 2024 08:04

fixup! Initial OpenELM support (270M only so far)

60b2e1b

Fix formatting

mofosyne added model Model specific review complexity : high Generally require indepth knowledge of LLMs or GPUs labels May 18, 2024

github-actions bot added the python python script changes label May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenELM support #7359

OpenELM support #7359

icecream95 commented May 18, 2024

icecream95 commented May 18, 2024

github-actions bot commented May 19, 2024

ggerganov commented May 19, 2024 •

edited

icecream95 commented May 30, 2024

joshcarp commented May 30, 2024

OpenELM support #7359

Are you sure you want to change the base?

OpenELM support #7359

Conversation

icecream95 commented May 18, 2024

icecream95 commented May 18, 2024

github-actions bot commented May 19, 2024

ggerganov commented May 19, 2024 • edited

icecream95 commented May 30, 2024

joshcarp commented May 30, 2024

ggerganov commented May 19, 2024 •

edited