Different result between use llama_tokenize and python original transformers tokenizer #7384

Liufeiran123 · 2024-05-19T13:26:47Z

Different result between use llama_tokenize and python original transformers tokenizer

Liufeiran123 · 2024-05-19T13:27:51Z

the model is qwen1.5-7b

JohannesGaessler · 2024-05-19T13:52:06Z

Can you share the results that you get for each?

Liufeiran123 · 2024-05-19T14:54:55Z

when prompt contains “\n” , the results is different
Transformers token is New Line
llama_tokenizer token is "\n" literally

Liufeiran123 · 2024-05-19T15:02:13Z

the problem is llama_tokenize can not recognize the escape character，such as "\n".

JohannesGaessler · 2024-05-19T15:09:38Z

The llama.cpp binaries like main have a CLI argument called --escape that handles this. But the escaping seems to be resolved in common/common.cpp rather than in llama.cpp. So I don't think you can automatically escape control characters using the llama.h API and I don't know whether there is a reason for this.

ggerganov · 2024-05-19T15:51:43Z

It might be better to move the escape logic from common to llama and enable it by default, since the current implementation is causing confusion

Liufeiran123 closed this as completed May 20, 2024

tombolano mentioned this issue May 30, 2024

model.tokenize result is different from tranformers tokenizer result abetlen/llama-cpp-python#1468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different result between use llama_tokenize and python original transformers tokenizer #7384

Different result between use llama_tokenize and python original transformers tokenizer #7384

Liufeiran123 commented May 19, 2024

Liufeiran123 commented May 19, 2024

JohannesGaessler commented May 19, 2024

Liufeiran123 commented May 19, 2024

Liufeiran123 commented May 19, 2024

JohannesGaessler commented May 19, 2024

ggerganov commented May 19, 2024

Different result between use llama_tokenize and python original transformers tokenizer #7384

Different result between use llama_tokenize and python original transformers tokenizer #7384

Comments

Liufeiran123 commented May 19, 2024

Liufeiran123 commented May 19, 2024

JohannesGaessler commented May 19, 2024

Liufeiran123 commented May 19, 2024

Liufeiran123 commented May 19, 2024

JohannesGaessler commented May 19, 2024

ggerganov commented May 19, 2024