Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds streaming option to generate #1424

Closed
wants to merge 10 commits into from
Closed

Adds streaming option to generate #1424

wants to merge 10 commits into from

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented May 16, 2024

Fixes #1420

@rasbt
Copy link
Collaborator Author

rasbt commented May 16, 2024

Is this what you were looking for @williamFalcon ?

@williamFalcon
Copy link
Contributor

nice!

can you test it with this Studio
https://lightning.ai/lightning-ai/studios/deploy-a-private-llama-3-8b-api

@rasbt
Copy link
Collaborator Author

rasbt commented May 17, 2024

It should work now @williamFalcon .

I modeled the adjusted litgpt serve, which optionally uses the stream=True option in generate after this example in LitServe. Since I am not super familiar with LitServe, could you double-check that this is indeed the correct use @aniketmaurya and @lantiga ?

@rasbt rasbt added the enhancement New feature or request label May 17, 2024
@rasbt
Copy link
Collaborator Author

rasbt commented May 17, 2024

Many thanks for the feedback @aniketmaurya ! Does the current code look better?

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not import and use the chat.py generation implementation which already implements streaming? Instead of modifying the generate.py file

Copy link
Contributor

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! added few comments.

Also please take a latest pull of LitServe, last week we merged some PR which affects streaming.

litgpt/generate/base.py Show resolved Hide resolved
litgpt/deploy/serve.py Outdated Show resolved Hide resolved
@rasbt
Copy link
Collaborator Author

rasbt commented May 20, 2024

@carmocca That's a good question, I am not sure why I was so fixated on generate. Let me see (in a separate PR) if I can make this serving example work with chat, then we don't need to modify generate.

Thanks for the feedback, @aniketmaurya !

Also please take a latest pull of LitServe, last week we merged some PR which affects streaming.

I could try that internally for testing but we'd probably have to wait until the next release. But could you ping me when the latest LitServe comes out so that I can bump the version in LitGPT?

@rasbt
Copy link
Collaborator Author

rasbt commented May 20, 2024

Thanks for all your help with LitServe so far @aniketmaurya . It's probably something stupid on my part, but with the recent LitServe version from GitHub main somehow it's still not working right

I.e. with

litgpt  serve --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream false

response = requests.post(
        "http://127.0.0.1:8000/predict",
         json={"prompt": "Fix typos in the following sentence: Exampel
         input"}
 )

The response.content comes out as

--stream true

b'{"output": " file"}{"output": ":"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "7"}{"output": "."}{"output": "idx"}{"output": "."}{"output": "gz"}{"output": ":"}{"output": " error"}{"output": " opening"}{"output": " /"}{"output": "usr"}{"output": "/"}{"output": "share"}{"output": "/"}{"output": "X"}{"output": "11"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}{"output": "d"}{"output": "/"}{"output": "x"}{"output": "org"}{"output": "."}{"output": "conf"}{"output": "."}'

whereas --stream false works as expected via LitServe (pls ignore the bad Pythia response, I just used a small model here for demoing:

b'{"output":"Fix typos in the following sentence: Exampel input.\\n\\nThis book is now available as a Kindle app.\\n\\nPlease go to the print page to view the rest of this book.\\n\\n## **Contents**\\n\\n**1** The Problem\\n\\n**2** How We Lear"}'

I think it's something in my LitServe implementation because the generate with stream=True works as intended (e.g., via litgpt generate base --checkpoint_dir checkpoints/EleutherAI/pythia-1b --stream true)

If you have some time to take another look, that'd be really appreciated.

@aniketmaurya
Copy link
Contributor

Hi @rasbt , we will have to iterate through the response in streaming mode like:

import requests

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"prompt": "Fix typos in the following sentence: Exampel input"}, headers=None,
                     stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))

when we don't iterate and use response.content, requests will concat everything in a bytestring.

@rasbt
Copy link
Collaborator Author

rasbt commented May 20, 2024

That makes sense! Thanks for your patience here @aniketmaurya , I know almost nothing about web stuff and serving. The code you shared does make sense, but yeah, it's still only generating a single token. Something must still be incorrect somewhere.

However, I also wonder if there's maybe a LitServe issue. I tried a simpler example:

import litserve as ls
import json


class SimpleStreamAPI(ls.LitAPI):
    def setup(self, device) -> None:
        self.model = lambda x: str(x)

    def decode_request(self, request):
        return request["input"]

    def predict(self, x):
        for i in range(10):
            yield self.model(i)

    def encode_response(self, output_stream):
        for output in output_stream:
            yield {"output": output}


if __name__ == "__main__":
    server = ls.LitServer(SimpleStreamAPI(), stream=True)
    server.run(port=8000)

and then via

import requests

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"input": "1, 2, 3"}, headers=None, stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))

it just returns ["{\"output\": \"0\"}"], which doesn't seem right. I would expect it to return 10 things due to the for-loop in predict, or am misunderstanding this?

EDIT: Nevermind, I pulled the latest LitServe version and it's fixed!

@rasbt
Copy link
Collaborator Author

rasbt commented May 20, 2024

Ok great, this works flawlessly with the new LitServe version now:

url = "http://127.0.0.1:8000/predict"

resp = requests.post(url, json={"prompt": "Hello world"}, headers=None, stream=True)
for line in resp.iter_lines():
    if line:
        print(line.decode("utf-8"))
{"output": "\n"}{"output": "\n"}{"output": "It"}{"output": " seems"}{"output": " that"}{"output": " I"}{"output": " can"}{"output": "'t"}{"output": " load"}{"output": " my"}{"output": " php"}{"output": " page"}{"output": " from"}{"output": " my"}{"output": " html"}{"output": " file"}{"output": "."}{"output": "\n"}{"output": "\n"}{"output": "A"}{"output": ":"}{"output": "\n"}{"output": "\n"}{"output": "1"}{"output": ")"}{"output": " Check"}{"output": " this"}{"output": " link"}{"output": ","}{"output": " it"}{"output": " might"}{"output": " solve"}{"output": " your"}{"output": " problem"}{"output": "."}{"output": "\n"}{"output": "2"}{"output": ")"}{"output": "Check"}{"output": " this"}{"output": " article"}{"output": "."}{"output": "\n"}{"output": "3"}{"output": ")"}{"output": " check"}{"output": " your"}{"output": " php"}{"output": " file"}{"output": " path"}

@rasbt
Copy link
Collaborator Author

rasbt commented May 20, 2024

I now also got it to work with the chat version via #1426. I guess we can close this one then and leave the base generate function untouched.

Or alternatively, the question is if we should replace the generate function chat/base.py with this one here that can handle both streaming and non-streaming @carmocca ? In the end, it would be less code to maintain (only one instead of two generate functions).

@carmocca
Copy link
Contributor

I didn't merge them before because the logic to check for the stop tokens is not as performant as simply generating a fixed number of tokens. They could probably still be reconcilliated with some if-elses but code simplicity over deduplication was desirable before

@rasbt
Copy link
Collaborator Author

rasbt commented May 23, 2024

Makes sense and no worries. I'd say let's keep it as is. I got the LitServe streaming example to work after all, so mission accomplished 😊

@rasbt rasbt closed this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stream option
4 participants