scripts: synthetic prompt mode for server-bench.py #14695
Merged
+124
−69
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR extends
scripts/server-bench.py
with synthetic prompts which are simply random lists of tokens with a configurable random length. Each prompt is then assigned a configurable random number of tokens to generate, the server is instructed to generate this exact number of tokens withn_predict
andignore_eos
. This ensures that the server performance can be tested easily and consistently for specific combinations of prompt lengths and generation lengths.I think there is something wrong with how the server handles
ignore_eos
. I'm setting it in the JSON but the server does not respect it. Also the way it's being handled seems incorrect, both theslot_params
and thesampling_params
have a fieldignore_eos
and they're used inconsistently.I changed the prompt latency to be based on the latency as observed by the Python script instead of as reported by the server. Long-term we can consider extending the script with support for other APIs/projects. I changed the number of workers back to the number of slots to avoid just measuring the time a worker spends in the queue waiting for a free slot.
I changed the script to pass arguments to the server via environment variables. This way users can pass any arguments to the server without
server-bench.py
becoming bloated.