-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : add "tokens" output #10853
server : add "tokens" output #10853
Conversation
ggml-ci
ggml-ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This added tokens
field can also increase the response JSON size quite a lot, if the returned text is long. Do you think we should only return it if user explicitly request? We can add a field like "return_tokens"
and default it to false
(see example of how "timings_per_token"
works)
std::string content; | ||
|
||
std::string content; | ||
llama_tokens tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also replace these 2 fields with completion_token_output
. Then inside send_partial_response
, we can std::move
it to res
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P/s: we cannot std::move
it, because inside process_token
, result
is still being used after send_partial_response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure that the "return_tokens"
logic is necessary. The tokens
array should be similar in JSON length to the content
string, though I am not sure performance wise how much slower it is to serialize an array of integers compared to a string. Anyway, I've added the flag and added tests.
Note that with "stream": true
we always return the tokens
field in the partial responses (i.e. this is not affected by the "return_tokens"
flag).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm thinking is that this should not degrade the performance of JSON serializing/parsing. But I'm just thinking about the bandwidth, because it seems like in most cases, we're now using double the bandwidth.
For stream
, I don't think it's a problem because time to serialize/send/receive/parse is minor compared to the time a token is generated.
But I think for now we can keep it this way. The non-OAI /completion
is a playground anw so it's fine to expose everything. The OAI compat /v1/completions
that I'm planning to do next will be more prod-ready, thus it won't have these data in the response.
Edit: I didn't notice that you implemented return_tokens
, that's good then, let's keep it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think for now we can keep it this way. The non-OAI /completion is a playground anw so it's fine to expose everything. The OAI compat /v1/completions that I'm planning to do next will be more prod-ready, thus it won't have these data in the response.
Yes, I agree we can keep /v1/completions
strongly OAI-compat (i.e. not even have extra fields like tokens
) and only have these in the non-OAI endpoints like /completions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you implemented
return_tokens
, that's good then, let's keep it 👍
This is great to see, thank you.
I sometimes use the /completions
API on a bandwidth-constrained network (Wireguard over a bad WAN connection) so having an option to disable tokens if I don't need them is perfect.
Co-authored-by: Xuan Son Nguyen <[email protected]>
4d7771b
to
5bf29af
Compare
The
llama-server
now also outputs raw token ids in its responses. This is useful for various applications, including TTS.