Llama.cpp server doesn't return grammar error messages when in streaming mode #7391

richardanaya · 2024-05-19T16:17:34Z

When you run a streaming request against llama.cpp there appears to be no way to see error messages in the http response related to grammar errors.

HanClinto · 2024-06-06T00:28:44Z

@richardanaya Sorry for my delay in seeing this!

Great catch, and thank you for the excellent writeup in the ollama repo! Fantastic reporting, thank you! I particularly appreciated the annotated screenshots of what you're seeing. I have not yet dug through that whole thread in detail, but plan to read more later.

Over there, you wrote:

My conclusion from this given the advice of the community is that we do indeed have to do our our GBNF grammar validation on the Go server side to do our best at preventing passing down bad grammar.

I've spent a bit of time (#5948, #5950, #6004) improving the GBNF grammar validation within llama.cpp, and I would hate to see this work be re-implemented. Building a separate GBNF validator is non-ideal, since any changes to the functionality of the base (such as extensions to the grammar from #6640 or #6467) would need to be duplicated in each other implementation. That's the path I originally started going down, but I ultimately chose the path in #5948) so that we would have a validation program that would always be in lockstep with master.

But that said, I would love to better-support the server. As noted in this comment), at the time that I started working on this, I wasn't entirely sure how to best provide this error-checking to the CLI or other applications (such as the server).

In an ideal world, what would be the behavior of Llama.cpp (for server and / or CLI) in the case of invalid grammar? I think I was the last one to touch the grammar validation code, and am happy to implement the changes you need. Knowing what you would like to see would help me implement this. I think that adding grammar support to ollama would be a huge quality of life improvement for a lot of people -- I want to do whatever I can to make this happen!

richardanaya · 2024-06-06T01:13:24Z

@HanClinto hey! thanks for the response. I 100% agree with you, I would love to just use llama.cpp too. The biggest issue seems to be short-circuiting execution of tasks (without ejecting the in memory model) during "streaming=true" mode. Any kind of 400 error would do :)

if that's not easy, having some kind of /grammar validator endpoint to check would be nice too I could POST to

HanClinto · 2024-06-06T06:29:15Z

The biggest issue seems to be short-circuiting execution of tasks (without ejecting the in memory model) during "streaming=true" mode. Any kind of 400 error would do :)

Sounds pretty reasonable. Looks like 422 might be a reasonable error to return in the case of an invalid grammar?

if that's not easy, having some kind of /grammar validator endpoint to check would be nice too I could POST to

Could definitely add something like that as well. Might be nice to expose the gbnf-validator functionality via such an endpoint.

I haven't messed around with the streaming mode of the server very much -- do you happen to have a sample script handy that I can use for testing? If not, I'm happy to figure it out and write something up in Python or whatever, but figured it was worth it for me to ask. If it's not self-contained / not easy-to-share, then don't worry about it.

Out of curiosity, do you happen to know what the behavior is like when streaming is set to false -- is that more sane behavior? If so, then that might be able to function as your /grammar validator endpoint for the time being.

HanClinto · 2024-06-06T06:56:09Z

Hmm, reading through the server documentation it appears that 400 is already reserved for an invalid grammar error:

*When the server receives invalid grammar via /completions endpoint

{
    "error": {
        "code": 400,
        "message": "Failed to parse grammar",
        "type": "invalid_request_error"
    }
}

I'm reading through the server code and not entirely sure why this response code of 400 wouldn't be returned when streaming = true -- I still need an easier test setup but I'd like to dig to the bottom of this.

richardanaya added the bug-unconfirmed label May 19, 2024

richardanaya mentioned this issue May 20, 2024

Exposing grammar as a request parameter in completion/chat with go-side grammar validation ollama/ollama#4525

Open

HanClinto self-assigned this Jun 6, 2024

HanClinto mentioned this issue Jun 6, 2024

Refactor: investigate cleaner exception handling for server/server.cpp #7787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama.cpp server doesn't return grammar error messages when in streaming mode #7391

Llama.cpp server doesn't return grammar error messages when in streaming mode #7391

richardanaya commented May 19, 2024

HanClinto commented Jun 6, 2024

richardanaya commented Jun 6, 2024 •

edited

HanClinto commented Jun 6, 2024

HanClinto commented Jun 6, 2024

Llama.cpp server doesn't return grammar error messages when in streaming mode #7391

Llama.cpp server doesn't return grammar error messages when in streaming mode #7391

Comments

richardanaya commented May 19, 2024

HanClinto commented Jun 6, 2024

richardanaya commented Jun 6, 2024 • edited

HanClinto commented Jun 6, 2024

HanClinto commented Jun 6, 2024

richardanaya commented Jun 6, 2024 •

edited