Add special tokens to server llama_decode() inputs

The llamafile server /embedding endpoint was returning embeddings that were very inconsistent with llama.cpp. This is due to changes upstream with tokenization. The upstream project was adding special tokens like ["[CLS]", " apples", " are", " red", " .", "[SEP]"] before running the operation. We're now handling things more similar to upstream although the llama.cpp server has diverged so much since removing LLaVA support that they're very different pieces of software at this point. Fixes #391
Mozilla-Ocho · May 4, 2024 · 7900294 · 7900294
1 parent 42bd9b8
commit 7900294
Show file tree

Hide file tree

Showing 3 changed files with 23 additions and 10 deletions.
diff --git a/llama.cpp/main/main.1 b/llama.cpp/main/main.1
@@ -559,6 +559,19 @@ Print token count every
 tokens.
 .Pp
 Default: -1
+.It Fl Fl pooling Ar KIND
+Specifies pooling type for embeddings. This may be one of:
+.Pp
+.Bl -dash -compact
+.It
+none
+.It
+mean
+.It
+cls
+.El
+.Pp
+The model default is used if unspecified.
 .El
 .Sh CLI OPTIONS
 The following options may be specified when
@@ -741,10 +754,6 @@ Path from which to serve static files.
 .Pp
 Default:
 .Pa /zip/llama.cpp/server/public
-.It Fl Fl embedding
-Enable embedding vector output.
-.Pp
-Default: disabled
 .It Fl Fl nobrowser
 Do not attempt to open a web browser tab at startup.
 .It Fl gan Ar N , Fl Fl grp-attn-n Ar N

diff --git a/llama.cpp/main/main.1.asc b/llama.cpp/main/main.1.asc
@@ -537,6 +537,15 @@
 
                Default: ‐1
 
+       [1m--pooling [4m[22mKIND[0m
+               Specifies pooling type for embeddings. This may be one of:
+
+               [1m-   [22mnone
+               [1m-   [22mmean
+               [1m-   [22mcls
+
+               The model default is used if unspecified.
+
 [1mCLI OPTIONS[0m
        The following options may be specified when  [1mllamafile  [22mis  running  in
        [1m--cli [22mmode.
@@ -737,11 +746,6 @@
 
                Default: [4m/zip/llama.cpp/server/public[0m
 
-       [1m--embedding[0m
-               Enable embedding vector output.
-
-               Default: disabled
-
        [1m--nobrowser[0m
                Do not attempt to open a web browser tab at startup.
 

diff --git a/llama.cpp/server/server.cpp b/llama.cpp/server/server.cpp
@@ -1783,7 +1783,7 @@ struct llama_server_context
                     }
                     else
                     {
-                        prompt_tokens = tokenize(slot.prompt, system_prompt.empty() && add_bos_token);  // add BOS if there isn't system prompt
+                        prompt_tokens = tokenize(slot.prompt, system_prompt.empty());  // add BOS if there isn't system prompt
                     }
 
                     slot.num_prompt_tokens = prompt_tokens.size();