How to inference with the converted GGUF using llama-cpp? #484

mk0223 · 2024-05-17T08:50:49Z

I would appreciate if anyone can help with the following problem when using the converted GGUF for inference.

I found that inferencing with llama-cpp generates a different result from inferencing with the saved LoRA adapters. I am using both Q4 quantized model.

For inferencing with LoRA, I kept the alpaca_prompt format:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        instruct, # instruction
        description, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 8000, use_cache = True,temperature=0)
tokenizer.batch_decode(outputs)

For inferencing with llama-cpp, I used its chat completion since I didn't find a way to retain the alpaca_prompt format:

llm = Llama(
      model_path=SAVED_PATH,
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      seed=1, # Uncomment to set a specific seed
      n_ctx=2048, # Uncomment to increase the context window
      # tokenizer=LlamaHFTokenizer.from_pretrained(SAVED_PATH) # is this necessary???
)
...
output = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": instruct},
          {"role": "user","content": description}
      ],
      temperature=0,
      max_tokens=8000
)

Is it necessary to retain the alpaca_prompt format or to convert the tokenizers from unsloth to llama-cpp?

In #(https://github.com/abetlen/llama-cpp-python), it is mentioned that:
"Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is required to provide HF Tokenizer for functionary. The LlamaHFTokenizer class can be initialized and passed into the Llama class. This will override the default llama.cpp tokenizer used in Llama class. The tokenizer files are already included in the respective HF repositories hosting the gguf files."
I don't quite understand if such descrepancy exist since the Unsloth demo notebook doesn't seem to mention.

Thanks!

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-05-17T18:07:06Z

A good idea to use llama-cpp's Python module - ill make an example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to inference with the converted GGUF using llama-cpp? #484

How to inference with the converted GGUF using llama-cpp? #484

mk0223 commented May 17, 2024 •

edited

danielhanchen commented May 17, 2024

How to inference with the converted GGUF using llama-cpp? #484

How to inference with the converted GGUF using llama-cpp? #484

Comments

mk0223 commented May 17, 2024 • edited

danielhanchen commented May 17, 2024

mk0223 commented May 17, 2024 •

edited