server : add "tokens" output #10853

ggerganov · 2024-12-16T19:06:30Z

The llama-server now also outputs raw token ids in its responses. This is useful for various applications, including TTS.

curl --request POST --url http://localhost:8033/completion --header "Content-Type: application/json" --data '{"prompt": "Hello, my name is", "n_predict": 8 }' | jq

{
  "index": 0,
  "content": " Aki, and I am the owner",
  "tokens": [
    362,
    6642,
    11,
    323,
    358,
    1079,
    279,
    6372
  ],
  "id_slot": 0,
  "stop": true,
  "model": "gpt-3.5-turbo",
  "tokens_predicted": 8,
  "tokens_evaluated": 5,
...
}

ggml-ci

ngxson

This added tokens field can also increase the response JSON size quite a lot, if the returned text is long. Do you think we should only return it if user explicitly request? We can add a field like "return_tokens" and default it to false (see example of how "timings_per_token" works)

ngxson · 2024-12-16T19:54:56Z

examples/server/server.cpp

-    std::string content;
+
+    std::string  content;
+    llama_tokens tokens;


I think we can also replace these 2 fields with completion_token_output. Then inside send_partial_response, we can std::move it to res

P/s: we cannot std::move it, because inside process_token, result is still being used after send_partial_response

I'm not really sure that the "return_tokens" logic is necessary. The tokens array should be similar in JSON length to the content string, though I am not sure performance wise how much slower it is to serialize an array of integers compared to a string. Anyway, I've added the flag and added tests.

Note that with "stream": true we always return the tokens field in the partial responses (i.e. this is not affected by the "return_tokens" flag).

What I'm thinking is that this should not degrade the performance of JSON serializing/parsing. But I'm just thinking about the bandwidth, because it seems like in most cases, we're now using double the bandwidth.

For stream, I don't think it's a problem because time to serialize/send/receive/parse is minor compared to the time a token is generated.

But I think for now we can keep it this way. The non-OAI /completion is a playground anw so it's fine to expose everything. The OAI compat /v1/completions that I'm planning to do next will be more prod-ready, thus it won't have these data in the response.

Edit: I didn't notice that you implemented return_tokens, that's good then, let's keep it 👍

But I think for now we can keep it this way. The non-OAI /completion is a playground anw so it's fine to expose everything. The OAI compat /v1/completions that I'm planning to do next will be more prod-ready, thus it won't have these data in the response.

Yes, I agree we can keep /v1/completions strongly OAI-compat (i.e. not even have extra fields like tokens) and only have these in the non-OAI endpoints like /completions.

you implemented return_tokens, that's good then, let's keep it 👍

This is great to see, thank you.

I sometimes use the /completions API on a bandwidth-constrained network (Wireguard over a bad WAN connection) so having an option to disable tokens if I don't need them is perfect.

ggml-ci

examples/server/tests/unit/test_completion.py

Co-authored-by: Xuan Son Nguyen <[email protected]>

ggml-ci

ggerganov added 2 commits December 16, 2024 21:03

server : add "tokens" output

79a8176

ggml-ci

server : update readme

d58f8a1

ggml-ci

ggerganov requested a review from ngxson as a code owner December 16, 2024 19:06

github-actions bot added examples server labels Dec 16, 2024

ngxson reviewed Dec 16, 2024

View reviewed changes

This was referenced Dec 17, 2024

server : output embeddings for all tokens when pooling = none #10861

Merged

tts : add OuteTTS support #10784

Merged

server : return tokens ids only if requested

8bcfc55

ggml-ci

ngxson approved these changes Dec 17, 2024

View reviewed changes

examples/server/tests/unit/test_completion.py Outdated Show resolved Hide resolved

github-actions bot added the python python script changes label Dec 17, 2024

tests : improve "tokens" type check

5bf29af

Co-authored-by: Xuan Son Nguyen <[email protected]>

ggerganov force-pushed the gg/server-content-tokens branch from 4d7771b to 5bf29af Compare December 18, 2024 08:04

server : remove "tokens" from the OAI endpoint

99cb6be

ggml-ci

ggerganov merged commit 0e70ba6 into master Dec 18, 2024
56 of 58 checks passed

ggerganov deleted the gg/server-content-tokens branch December 18, 2024 09:05

ggerganov mentioned this pull request Dec 18, 2024

changelog : llama-server REST API #9291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : add "tokens" output #10853

server : add "tokens" output #10853

ggerganov commented Dec 16, 2024 •

edited

Loading

ngxson left a comment

ngxson Dec 16, 2024

ngxson Dec 16, 2024 •

edited

Loading

ggerganov Dec 17, 2024

ngxson Dec 17, 2024 •

edited

Loading

ggerganov Dec 18, 2024

isaac-mcfadyen Dec 18, 2024 •

edited

Loading

server : add "tokens" output #10853

server : add "tokens" output #10853

Conversation

ggerganov commented Dec 16, 2024 • edited Loading

ngxson left a comment

Choose a reason for hiding this comment

ngxson Dec 16, 2024

Choose a reason for hiding this comment

ngxson Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Dec 17, 2024

Choose a reason for hiding this comment

ngxson Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Dec 18, 2024

Choose a reason for hiding this comment

isaac-mcfadyen Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Dec 16, 2024 •

edited

Loading

ngxson Dec 16, 2024 •

edited

Loading

ngxson Dec 17, 2024 •

edited

Loading

isaac-mcfadyen Dec 18, 2024 •

edited

Loading