Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenELM support #7359

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

OpenELM support #7359

wants to merge 3 commits into from

Conversation

icecream95
Copy link

Fixes: #6868.

Thanks to @joshcarp for an initial try at doing this (#6986), it was very helpful as a source to copy-paste from and check against.

Currently a bunch of the configuration is hardcoded into llama.cpp, so only the 270M model works at this point.

The ffn_up tensors in the converted model are actually concatenations of ffn_gate and ffn_up, perhaps the conversion script should separate them out?

The 270M model is impressively fast, and works fine for generation, but "Chat" mode in ./server doesn't really work well. Perhaps that's just because it hasn't been finetuned for that? I'm not really sure.

@icecream95 icecream95 marked this pull request as draft May 18, 2024 07:53
@icecream95 icecream95 changed the title Draft: OpenELM support OpenELM support May 18, 2024
@mofosyne mofosyne added model Model specific review complexity : high Generally require indepth knowledge of LLMs or GPUs labels May 18, 2024
@icecream95
Copy link
Author

It looks like context shift currently causes crashes, because build_k_shift uses the false number of heads in the .gguf.

A few other functions seem like they will be broken as well.

@github-actions github-actions bot added the python python script changes label May 18, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 512 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9168.82ms p(95)=23246.02ms fails=, finish reason: stop=444 truncated=68
  • Prompt processing (pp): avg=111.45tk/s p(95)=515.19tk/s
  • Token generation (tg): avg=31.54tk/s p(95)=46.05tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=openelm commit=60b2e1b9c529f74f5bf881b05a6247ff6f58a71c

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 452.86, 452.86, 452.86, 452.86, 452.86, 896.59, 896.59, 896.59, 896.59, 896.59, 871.99, 871.99, 871.99, 871.99, 871.99, 877.49, 877.49, 877.49, 877.49, 877.49, 882.82, 882.82, 882.82, 882.82, 882.82, 887.52, 887.52, 887.52, 887.52, 887.52, 875.85, 875.85, 875.85, 875.85, 875.85, 890.18, 890.18, 890.18, 890.18, 890.18, 891.96, 891.96, 891.96, 891.96, 891.96, 903.52, 903.52, 903.52, 903.52, 903.52, 927.23, 927.23, 927.23, 927.23, 927.23, 913.62, 913.62, 913.62, 913.62, 913.62, 910.92, 910.92, 910.92, 910.92, 910.92, 926.37, 926.37, 926.37, 926.37, 926.37, 909.03, 909.03, 909.03, 909.03, 909.03, 912.02, 912.02, 912.02, 912.02, 912.02, 909.29, 909.29, 909.29, 909.29, 909.29, 892.43, 892.43, 892.43, 892.43, 892.43, 893.16, 893.16, 893.16, 893.16, 893.16, 887.09, 887.09, 887.09, 887.09, 887.09, 892.09, 892.09, 892.09, 892.09, 892.09, 891.43, 891.43, 891.43, 891.43, 891.43, 892.08, 892.08, 892.08, 892.08, 892.08, 886.73, 886.73, 886.73, 886.73, 886.73, 883.02, 883.02, 883.02, 883.02, 883.02, 882.49, 882.49, 882.49, 882.49, 882.49, 895.2, 895.2, 895.2, 895.2, 895.2, 891.89, 891.89, 891.89, 891.89, 891.89, 890.09, 890.09, 890.09, 890.09, 890.09, 889.3, 889.3, 889.3, 889.3, 889.3, 893.44, 893.44, 893.44, 893.44, 893.44, 892.14, 892.14, 892.14, 892.14, 892.14, 890.24, 890.24, 890.24, 890.24, 890.24, 893.23, 893.23, 893.23, 893.23, 893.23, 900.85, 900.85, 900.85, 900.85, 900.85, 901.77, 901.77, 901.77, 901.77, 901.77, 904.26, 904.26, 904.26, 904.26, 904.26, 907.47, 907.47, 907.47, 907.47, 907.47, 905.15, 905.15, 905.15, 905.15, 905.15, 901.81, 901.81, 901.81, 901.81, 901.81, 904.25, 904.25, 904.25, 904.25, 904.25, 905.24, 905.24, 905.24, 905.24, 905.24, 905.87, 905.87, 905.87, 905.87, 905.87, 909.23, 909.23, 909.23, 909.23, 909.23, 905.0, 905.0, 905.0, 905.0, 905.0, 905.48, 905.48, 905.48, 905.48, 905.48, 903.32, 903.32, 903.32, 903.32, 903.32, 901.1, 901.1, 901.1, 901.1, 901.1, 896.0, 896.0, 896.0, 896.0, 896.0, 899.86, 899.86, 899.86, 899.86, 899.86, 902.19, 902.19, 902.19, 902.19, 902.19, 901.36, 901.36, 901.36, 901.36, 901.36, 904.94, 904.94, 904.94, 904.94, 904.94, 903.99, 903.99, 903.99, 903.99, 903.99, 903.29, 903.29, 903.29, 903.29, 903.29, 906.62, 906.62, 906.62, 906.62, 906.62, 907.12, 907.12, 907.12, 907.12, 907.12, 910.26, 910.26, 910.26, 910.26, 910.26, 909.96, 909.96, 909.96, 909.96, 909.96, 909.5, 909.5, 909.5, 909.5, 909.5, 908.99, 908.99, 908.99]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.8, 38.8, 38.8, 38.8, 38.8, 31.58, 31.58, 31.58, 31.58, 31.58, 28.54, 28.54, 28.54, 28.54, 28.54, 28.83, 28.83, 28.83, 28.83, 28.83, 30.45, 30.45, 30.45, 30.45, 30.45, 30.37, 30.37, 30.37, 30.37, 30.37, 31.59, 31.59, 31.59, 31.59, 31.59, 32.59, 32.59, 32.59, 32.59, 32.59, 33.03, 33.03, 33.03, 33.03, 33.03, 33.78, 33.78, 33.78, 33.78, 33.78, 33.73, 33.73, 33.73, 33.73, 33.73, 33.98, 33.98, 33.98, 33.98, 33.98, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 33.23, 31.24, 31.24, 31.24, 31.24, 31.24, 30.47, 30.47, 30.47, 30.47, 30.47, 30.72, 30.72, 30.72, 30.72, 30.72, 30.84, 30.84, 30.84, 30.84, 30.84, 30.64, 30.64, 30.64, 30.64, 30.64, 30.15, 30.15, 30.15, 30.15, 30.15, 29.96, 29.96, 29.96, 29.96, 29.96, 29.7, 29.7, 29.7, 29.7, 29.7, 29.91, 29.91, 29.91, 29.91, 29.91, 29.94, 29.94, 29.94, 29.94, 29.94, 29.73, 29.73, 29.73, 29.73, 29.73, 30.07, 30.07, 30.07, 30.07, 30.07, 30.03, 30.03, 30.03, 30.03, 30.03, 30.0, 30.0, 30.0, 30.0, 30.0, 29.92, 29.92, 29.92, 29.92, 29.92, 30.08, 30.08, 30.08, 30.08, 30.08, 30.26, 30.26, 30.26, 30.26, 30.26, 30.4, 30.4, 30.4, 30.4, 30.4, 30.5, 30.5, 30.5, 30.5, 30.5, 30.57, 30.57, 30.57, 30.57, 30.57, 30.55, 30.55, 30.55, 30.55, 30.55, 30.48, 30.48, 30.48, 30.48, 30.48, 29.94, 29.94, 29.94, 29.94, 29.94, 29.86, 29.86, 29.86, 29.86, 29.86, 29.53, 29.53, 29.53, 29.53, 29.53, 29.36, 29.36, 29.36, 29.36, 29.36, 29.49, 29.49, 29.49, 29.49, 29.49, 29.59, 29.59, 29.59, 29.59, 29.59, 29.71, 29.71, 29.71, 29.71, 29.71, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.84, 29.74, 29.74, 29.74, 29.74, 29.74, 29.42, 29.42, 29.42, 29.42, 29.42, 29.16, 29.16, 29.16, 29.16, 29.16, 27.96, 27.96, 27.96, 27.96, 27.96, 28.02, 28.02, 28.02, 28.02, 28.02, 28.07, 28.07, 28.07, 28.07, 28.07, 28.26, 28.26, 28.26, 28.26, 28.26, 28.27, 28.27, 28.27, 28.27, 28.27, 28.3, 28.3, 28.3, 28.3, 28.3, 28.34, 28.34, 28.34, 28.34, 28.34, 28.31, 28.31, 28.31, 28.31, 28.31, 28.3, 28.3, 28.3, 28.3, 28.3, 28.22, 28.22, 28.22, 28.22, 28.22, 28.27, 28.27, 28.27, 28.27, 28.27, 28.34, 28.34, 28.34, 28.34, 28.34, 28.41, 28.41, 28.41]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17, 0.17, 0.17, 0.17, 0.17, 0.43, 0.43, 0.43, 0.43, 0.43, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.37, 0.37, 0.37, 0.37, 0.37, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.35, 0.35, 0.35, 0.35, 0.35, 0.29, 0.29, 0.29, 0.29, 0.29, 0.45, 0.45, 0.45, 0.45, 0.45, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.44, 0.44, 0.44, 0.44, 0.44, 0.54, 0.54, 0.54, 0.54, 0.54, 0.66, 0.66, 0.66, 0.66, 0.66, 0.58, 0.58, 0.58, 0.58, 0.58, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 512 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716104319 --> 1716104943
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0]
                    

@ggerganov
Copy link
Owner

ggerganov commented May 19, 2024

The ffn_up tensors in the converted model are actually concatenations of ffn_gate and ffn_up, perhaps the conversion script should separate them out?

We already have this logic for the Refact models:

elif name == f"transformer.h.{bid}.mlp.gate_up_proj.weight":
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE, bid), data_torch[:ff_dim]))
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP, bid), data_torch[ff_dim:]))

You can try to reuse it in a similar way for OpenELM

The 270M model is impressively fast, and works fine for generation, but "Chat" mode in ./server doesn't really work well. Perhaps that's just because it hasn't been finetuned for that? I'm not really sure.

Have you ran perplexity runs with this model?

It looks like context shift currently causes crashes, because build_k_shift uses the false number of heads in the .gguf.

A few other functions seem like they will be broken as well.

We'll probably need to generalize the head number to be determined per layer. Do you need to some assistance with that?

@icecream95
Copy link
Author

I've been quite tired recently, so it might be a while before I'm able to come back to this.

I see that @jart's #7445 has already been merged with similar modifications to llama_model_type_name, but I think git merge will do the right thing here without needing to change that commit.

@joshcarp
Copy link

@icecream95 might jump back on this cause im curious to where i got stuck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific python python script changes review complexity : high Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for OpenELM of Apple
4 participants