-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : remove self-extend features #9860
Conversation
if (!params.ctx_shift) { | ||
// this check is redundant (for good) | ||
// we should never get here, because generation should already stopped in process_token() | ||
slot.release(); | ||
send_error(slot, "context shift is disabled", ERROR_TYPE_SERVER); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I think the comment is not entirely correct because in process_token()
we check agains the training context length (n_ctx_train
), while the slot's context slot.n_ctx
could be smaller. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm no, I did add a check against slot.n_ctx
. Is this what you're looking for?
llama.cpp/examples/server/server.cpp
Lines 1184 to 1191 in 11ac980
// if context shift is disabled, we stop when it reaches the context limit | |
if (slot.n_decoded >= slot.n_ctx) { | |
slot.truncated = true; | |
slot.stopped_limit = true; | |
slot.has_next_token = false; | |
SLT_DBG(slot, "stopped due to running out of context capacity, n_decoded = %d, n_ctx = %d\n", slot.n_decoded, slot.n_ctx); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that, thanks.
Shouldn't we check this actually:
if (slot.n_prompt_tokens + slot.n_decoded >= n_ctx) {
Hmm, or maybe:
if (slot.n_past + slot.n_decoded >= n_ctx) {
Anyway, I will figure it out as I'm looking into this logic currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I misunderstood n_decoded
. Yeah, maybe we even need (int) system_tokens.size() + slot.n_prompt_tokens
because system_tokens
is already in KV cache before the first decode.
Thanks for looking into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sorry I haven't see #9811
ggml-ci
e5f74fe
to
8a1f439
Compare
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci
target #9857, fix #9859
Drop support for the self-extend related arguments: