-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Dude, c'mon don't just post slop?
Instructions say:
# Start server with speculative decoding
llama-server \
-m /path/to/target-model.gguf \
--model-draft /path/to/draft-model.gguf \
-ngl 99 -c 4096 --port 8080
# In another terminal, run benchmark
python bench.py --url http://127.0.0.1:8080 --requests 5 --max-tokens 512
so I run a server........and......... lol
#$ python3 bench.py --url http://127.0.0.1:8123 --requests 5 --max-tokens 512
usage: bench.py [-h] --base-url BASE_URL --model MODEL [--api-key API_KEY] [--compare-url COMPARE_URL] [--compare-model COMPARE_MODEL] [--compare-api-key COMPARE_API_KEY] [--compare-label COMPARE_LABEL]
[--label LABEL] [--runs RUNS] [--max-tokens MAX_TOKENS] [--temperature TEMPERATURE] [--prompt PROMPT]
bench.py: error: the following arguments are required: --base-url, --model /0.3s
#$ python3 bench.py --base-url http://127.0.0.1:8123 --requests 5 --max-tokens 512
usage: bench.py [-h] --base-url BASE_URL --model MODEL [--api-key API_KEY] [--compare-url COMPARE_URL] [--compare-model COMPARE_MODEL] [--compare-api-key COMPARE_API_KEY] [--compare-label COMPARE_LABEL]
[--label LABEL] [--runs RUNS] [--max-tokens MAX_TOKENS] [--temperature TEMPERATURE] [--prompt PROMPT]
bench.py: error: the following arguments are required: --model /0.3s
#$ python3 bench.py --base-url http://127.0.0.1:8123 --requests 5 --max-tokens 512 --model /mnt/models/qwen/bartowski/Qwen_Qwen2.5-VL-72B-Instruct-Q8_0-00001-of-00002.gguf
usage: bench.py [-h] --base-url BASE_URL --model MODEL [--api-key API_KEY] [--compare-url COMPARE_URL] [--compare-model COMPARE_MODEL] [--compare-api-key COMPARE_API_KEY] [--compare-label COMPARE_LABEL]
[--label LABEL] [--runs RUNS] [--max-tokens MAX_TOKENS] [--temperature TEMPERATURE] [--prompt PROMPT]
bench.py: error: unrecognized arguments: --requests 5
#$ python3 bench.py --base-url http://127.0.0.1:8123 --max-tokens 512 --model /mnt/models/qwen/bartowski/Qwen_Qwen2.5-VL-72B-Instruct-Q8_0-00001-of-00002.gguf
======================================================================
draftbench - OpenAI-compatible endpoint benchmark
======================================================================
Endpoint : http://127.0.0.1:8123
Model : /mnt/models/qwen/bartowski/Qwen_Qwen2.5-VL-72B-Instruct-Q8_0-00001-of-00002.gguf
Prompts : 3
Runs : 1
MaxTok : 512
Temp : 0.0
======================================================================
[baseline] request 1/3 ERROR: HTTPConnectionPool(host='127.0.0.1', port=8123): Read timed out. (read timeout=120)
[baseline] request 2/3 ...
Also I need to be able to pass offload splits (eg. -ngl 9 -ts 8,1) so I'm not sure if I can even try the server runner (unless you wanna share some hardware with a bro, I'm stuck trying to split a 12gb 3060 and a 6gb 2060 on a 768gb ram old xeon server or with an m1 studio 64gb)
was really hoping I could try and get a bit more out of my sh!tty server with this tool :)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels