-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Labels
Description
The goal of this feature is to reduce latency for repeated calls to the chat_completion api by saving the kv_cache keyed by the prompt tokens.
The basic version of this is to simply save the kv_state after the prompt is generated.
Additionally we should investigate if it's possible save and restore the kv_state after the completion has been generated as well.
keldenl, secustor, gjmulder, Okabintaro, Priestru and 2 more