Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to inference with the converted GGUF using llama-cpp? #484

Open
mk0223 opened this issue May 17, 2024 · 1 comment
Open

How to inference with the converted GGUF using llama-cpp? #484

mk0223 opened this issue May 17, 2024 · 1 comment

Comments

@mk0223
Copy link

mk0223 commented May 17, 2024

I would appreciate if anyone can help with the following problem when using the converted GGUF for inference.

I found that inferencing with llama-cpp generates a different result from inferencing with the saved LoRA adapters. I am using both Q4 quantized model.

For inferencing with LoRA, I kept the alpaca_prompt format:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        instruct, # instruction
        description, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 8000, use_cache = True,temperature=0)
tokenizer.batch_decode(outputs)

For inferencing with llama-cpp, I used its chat completion since I didn't find a way to retain the alpaca_prompt format:

llm = Llama(
      model_path=SAVED_PATH,
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=1, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
      # tokenizer=LlamaHFTokenizer.from_pretrained(SAVED_PATH) # is this necessary???
)
...
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": instruct},
          {"role": "user","content": description}
      ],
      temperature=0,
      max_tokens=8000
)

Is it necessary to retain the alpaca_prompt format or to convert the tokenizers from unsloth to llama-cpp?

In #(https://github.com/abetlen/llama-cpp-python), it is mentioned that:
"Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files."
I don't quite understand if such descrepancy exist since the Unsloth demo notebook doesn't seem to mention.

Thanks!

@danielhanchen
Copy link
Contributor

A good idea to use llama-cpp's Python module - ill make an example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants