-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
version: 6315 (e8d99dd)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA, BLAS
Hardware
GTX 1080 ti & GTX 1050 ti
Models
https://huggingface.co/unsloth/Llama-3.2-3B-Instruct-GGUF
https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
(Q4_K_M for both)
Problem description & steps to reproduce
when I run either of these models with the --jinja
flag, any response format specifications are completely ignored. both specifying a schema, and generic json output are ignored. When I remove the flag, it works perfectly. For other modes like gemma3 and qwen3, it works fine even with the flag.
I suspect this might also affect some older models as well that haven't been tested.
How to reproduce:
1.) start llama-server with one of these models and the --jinja
flag
2.) send a request with a response format specification (either json mode, or a json schema)
3.) The result will ignore the response format
4.) retrying again without the --jinja
flag, and it will work as expected
First Bad Commit
No response
Relevant log output
llama-server
--n-predict 32000
--jinja
--api-key none
--timeout 3600
--metrics
--model models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
--main-gpu 0
--n-gpu-layers 999
--split-mode none
--ctx-size 74000
-fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
build: 6315 (e8d99dd0) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 4, n_threads_batch = 4, total_threads = 4
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 5807, http threads: 3
main: loading model
srv load_model: loading model '/home/emccartney/.cache/llama.cpp/unsloth_Llama-3.2-3B-Instruct-GGUF_Llama-3.2-3B-Instruct-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 10887 MiB free
srv log_server_r: request: GET /health 127.0.0.1 503
llama_model_loader: loaded meta data with 36 key-value pairs and 255 tensors from /home/emccartney/.cache/llama.cpp/unsloth_Llama-3.2-3B-Instruct-GGUF_Llama-3.2-3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.2-3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2-3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 3B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 15
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.2-3B-Instruct-GGUF/imatrix_un...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.2-3B-Inst...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3072
print_info: n_layer = 28
print_info: n_head = 24
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 3B
print_info: model params = 3.21 B
print_info: general.name = Llama-3.2-3B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 308.23 MiB
load_tensors: CUDA0 model buffer size = 1918.35 MiB
...........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 74000
llama_context: n_ctx_per_seq = 74000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (74000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.49 MiB
llama_kv_cache: CUDA0 KV buffer size = 8120.00 MiB
llama_kv_cache: size = 8120.00 MiB ( 74240 cells, 28 layers, 1/1 seqs), K (f16): 4060.00 MiB, V (f16): 4060.00 MiB
llama_context: CUDA0 compute buffer size = 256.50 MiB
llama_context: CUDA_Host compute buffer size = 151.01 MiB
llama_context: graph nodes = 875
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 74240
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 74240
main: model loaded
main: chat template, chat_template: {{- bos_token }}
{%- if custom_tools is defined %}
{%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
{%- if strftime_now is defined %}
{%- set date_string = strftime_now("%d %b %Y") %}
{%- else %}
{%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
{%- set system_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{%- set system_message = "" %}
{%- endif %}
{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
{{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}
{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
{#- Extract the first user message so we can plug it in here #}
{%- if messages | length != 0 %}
{%- set first_user_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
{{- "Given the following functions, please respond with a JSON for a function call " }}
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{{- first_user_message + "<|eot_id|>"}}
{%- endif %}
{%- for message in messages %}
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
{%- elif 'tool_calls' in message %}
{%- if not message.tool_calls|length == 1 %}
{{- raise_exception("This model only supports single tool-calls at once!") }}
{%- endif %}
{%- set tool_call = message.tool_calls[0].function %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
{{- '{"name": "' + tool_call.name + '", ' }}
{{- '"parameters": ' }}
{{- tool_call.arguments | tojson }}
{{- "}" }}
{{- "<|eot_id|>" }}
{%- elif message.role == "tool" or message.role == "ipython" %}
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
{%- if message.content is mapping or message.content is iterable %}
{{- message.content | tojson }}
{%- else %}
{{- message.content }}
{%- endif %}
{{- "<|eot_id|>" }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
, example_format: '<|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 29 Aug 2025
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'
main: server is listening on http://0.0.0.0:5807 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 1021 | processing task
slot update_slots: id 0 | task 1021 | new prompt, n_ctx_slot = 74240, n_keep = 0, n_prompt_tokens = 2987
slot update_slots: id 0 | task 1021 | kv cache rm [2971, end)
slot update_slots: id 0 | task 1021 | prompt processing progress, n_past = 2987, n_tokens = 16, progress = 0.005357
slot update_slots: id 0 | task 1021 | prompt done, n_past = 2987, n_tokens = 16
slot release: id 0 | task 1021 | stop processing: n_past = 3301, truncated = 0
slot print_timing: id 0 | task 1021 |
prompt eval time = 24.57 ms / 16 tokens ( 1.54 ms per token, 651.12 tokens per second)
eval time = 3684.73 ms / 315 tokens ( 11.70 ms per token, 85.49 tokens per second)
total time = 3709.31 ms / 331 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
llama-server
--n-predict 32000
--api-key none
--timeout 3600
--metrics
--model models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
--main-gpu 0
--n-gpu-layers 999
--split-mode none
--ctx-size 74000
-fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
Device 1: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
build: 6315 (e8d99dd0) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
system info: n_threads = 4, n_threads_batch = 4, total_threads = 4
system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CUDA : ARCHS = 610 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 5807, http threads: 3
main: loading model
srv load_model: loading model '/home/emccartney/.cache/llama.cpp/unsloth_Llama-3.2-3B-Instruct-GGUF_Llama-3.2-3B-Instruct-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1080 Ti) - 10887 MiB free
srv log_server_r: request: GET /health 127.0.0.1 503
llama_model_loader: loaded meta data with 36 key-value pairs and 255 tensors from /home/emccartney/.cache/llama.cpp/unsloth_Llama-3.2-3B-Instruct-GGUF_Llama-3.2-3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama-3.2-3B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Llama-3.2-3B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 3B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: llama.block_count u32 = 28
llama_model_loader: - kv 9: llama.context_length u32 = 131072
llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 15
llama_model_loader: - kv 32: quantize.imatrix.file str = Llama-3.2-3B-Instruct-GGUF/imatrix_un...
llama_model_loader: - kv 33: quantize.imatrix.dataset str = unsloth_calibration_Llama-3.2-3B-Inst...
llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 689
llama_model_loader: - type f32: 58 tensors
llama_model_loader: - type q4_K: 168 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 1.87 GiB (5.01 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3072
print_info: n_layer = 28
print_info: n_head = 24
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 3B
print_info: model params = 3.21 B
print_info: general.name = Llama-3.2-3B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128004 '<|finetune_right_pad_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 308.23 MiB
load_tensors: CUDA0 model buffer size = 1918.35 MiB
...........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 74000
llama_context: n_ctx_per_seq = 74000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (74000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.49 MiB
llama_kv_cache: CUDA0 KV buffer size = 8120.00 MiB
llama_kv_cache: size = 8120.00 MiB ( 74240 cells, 28 layers, 1/1 seqs), K (f16): 4060.00 MiB, V (f16): 4060.00 MiB
llama_context: CUDA0 compute buffer size = 256.50 MiB
llama_context: CUDA_Host compute buffer size = 151.01 MiB
llama_context: graph nodes = 875
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 74240
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 74240
main: model loaded
main: chat template, chat_template: {{- bos_token }}
{%- if custom_tools is defined %}
{%- set tools = custom_tools %}
{%- endif %}
{%- if not tools_in_user_message is defined %}
{%- set tools_in_user_message = true %}
{%- endif %}
{%- if not date_string is defined %}
{%- if strftime_now is defined %}
{%- set date_string = strftime_now("%d %b %Y") %}
{%- else %}
{%- set date_string = "26 Jul 2024" %}
{%- endif %}
{%- endif %}
{%- if not tools is defined %}
{%- set tools = none %}
{%- endif %}
{#- This block extracts the system message, so we can slot it into the right place. #}
{%- if messages[0]['role'] == 'system' %}
{%- set system_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{%- set system_message = "" %}
{%- endif %}
{#- System message #}
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
{%- if tools is not none %}
{{- "Environment: ipython\n" }}
{%- endif %}
{{- "Cutting Knowledge Date: December 2023\n" }}
{{- "Today Date: " + date_string + "\n\n" }}
{%- if tools is not none and not tools_in_user_message %}
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{%- endif %}
{{- system_message }}
{{- "<|eot_id|>" }}
{#- Custom tools are passed in a user message with some extra guidance #}
{%- if tools_in_user_message and not tools is none %}
{#- Extract the first user message so we can plug it in here #}
{%- if messages | length != 0 %}
{%- set first_user_message = messages[0]['content']|trim %}
{%- set messages = messages[1:] %}
{%- else %}
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
{%- endif %}
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
{{- "Given the following functions, please respond with a JSON for a function call " }}
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
{{- "Do not use variables.\n\n" }}
{%- for t in tools %}
{{- t | tojson(indent=4) }}
{{- "\n\n" }}
{%- endfor %}
{{- first_user_message + "<|eot_id|>"}}
{%- endif %}
{%- for message in messages %}
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
{%- elif 'tool_calls' in message %}
{%- if not message.tool_calls|length == 1 %}
{{- raise_exception("This model only supports single tool-calls at once!") }}
{%- endif %}
{%- set tool_call = message.tool_calls[0].function %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
{{- '{"name": "' + tool_call.name + '", ' }}
{{- '"parameters": ' }}
{{- tool_call.arguments | tojson }}
{{- "}" }}
{{- "<|eot_id|>" }}
{%- elif message.role == "tool" or message.role == "ipython" %}
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
{%- if message.content is mapping or message.content is iterable %}
{{- message.content | tojson }}
{%- else %}
{{- message.content }}
{%- endif %}
{{- "<|eot_id|>" }}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
, example_format: '<|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'
main: server is listening on http://0.0.0.0:5807 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 1084 | processing task
slot update_slots: id 0 | task 1084 | new prompt, n_ctx_slot = 74240, n_keep = 0, n_prompt_tokens = 3025
slot update_slots: id 0 | task 1084 | kv cache rm [3009, end)
slot update_slots: id 0 | task 1084 | prompt processing progress, n_past = 3025, n_tokens = 16, progress = 0.005289
slot update_slots: id 0 | task 1084 | prompt done, n_past = 3025, n_tokens = 16
slot release: id 0 | task 1084 | stop processing: n_past = 3243, truncated = 0
slot print_timing: id 0 | task 1084 |
prompt eval time = 44.03 ms / 16 tokens ( 2.75 ms per token, 363.42 tokens per second)
eval time = 2607.50 ms / 219 tokens ( 11.91 ms per token, 83.99 tokens per second)
total time = 2651.52 ms / 235 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200