Pulse · vllm-project/vllm · GitHub

August 3, 2025 – August 10, 2025

Overview

315 Active pull requests

287 Active issues

192 Pull requests merged by 80 people

Fix TensorSchema validation test for symbolic dims
#22366 merged Aug 10, 2025
Remove redundant row_indices unsqueeze operation in MiniCPMO
#22528 merged Aug 10, 2025
Migrate LlavaNextImageInputs to TensorSchema
#21774 merged Aug 10, 2025
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks
#22534 merged Aug 10, 2025
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel
#22593 merged Aug 10, 2025
[doc] add alibaba cloud as sponsor
#22597 merged Aug 10, 2025
[doc] add beijing meetup links
#22596 merged Aug 10, 2025
Move CacheConfig from config/__init__.py to config/cache.py
#22586 merged Aug 10, 2025
[Misc] Replace flaky image urls in pixtral test
#22574 merged Aug 10, 2025
[Docs] Fix warnings in docs build
#22588 merged Aug 10, 2025
[Misc] Further refine type annotations in parallel state
#22499 merged Aug 10, 2025
[Doc] Fix API doc link in side navigation
#22585 merged Aug 10, 2025
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config
#22566 merged Aug 10, 2025
[Minor] Fix pre-commit error on main
#22579 merged Aug 10, 2025
Refactor sliding window configuration to Transformers best practice
#21927 merged Aug 10, 2025
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block
#22394 merged Aug 10, 2025
Improve fast_topk function with type hints and documentation
#22530 merged Aug 10, 2025
[Config] add "qwen" as a native eagle3 target supported model
#22333 merged Aug 10, 2025
[oss] Init gpt-oss bf16 support
#22508 merged Aug 10, 2025
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers
#21401 merged Aug 10, 2025
[FEAT] [Performance] Add triton mrope to replace the torch code path
#22375 merged Aug 9, 2025
[Bugfix] Fix basic models tests hanging due to mm processor creation
#22571 merged Aug 9, 2025
[Model] Gemma3n MM
#20495 merged Aug 9, 2025
Move ParallelConfig from config/__init__.py to config/parallel.py
#22565 merged Aug 9, 2025
[Docs] Reduce noise in docs and --help from the JSON tip
#22567 merged Aug 9, 2025
[CI] [Hybrid] Speed up hybrid models test by removing large models
#22563 merged Aug 9, 2025
GLM-4.5V with new class name at transformers
#22520 merged Aug 9, 2025
Update docs for Minimax-Text support
#22562 merged Aug 9, 2025
[Bugfix] Fix CI moe kernel failure
#22556 merged Aug 9, 2025
[Bugfix] Fix failing GPT-OSS initialization test
#22557 merged Aug 9, 2025
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel
#22097 merged Aug 9, 2025
[TPU] Add support for online w8a8 quantization
#22425 merged Aug 9, 2025
Fix loading of quantized BigCode models
#22463 merged Aug 9, 2025
[Misc] Use config definitions from Transformers library
#21913 merged Aug 9, 2025
v1: Pass KVConnectorOutput to scheduler-side
#22157 merged Aug 9, 2025
[V1] [Hybrid] Support Minimax-Text-01 in V1
#22151 merged Aug 9, 2025
[Log] Add Warning for Deprecation of DeepGEMM old version
#22194 merged Aug 9, 2025
Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation
#22541 merged Aug 9, 2025
[Doc] Add usage of implicit text-only mode
#22561 merged Aug 9, 2025
Implicit language-model-only mode via limit-mm-per-prompt
#22299 merged Aug 9, 2025
[Bugfix] Fix ModernBert cuda graph capturing in v1
#21901 merged Aug 9, 2025
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D
#22317 merged Aug 9, 2025
[XPU] upgrade torch 2.8 on for XPU
#22300 merged Aug 9, 2025
Drop flaky test_healthcheck_response_time
#22539 merged Aug 8, 2025
Extract CompilationConfig from config.py
#22524 merged Aug 8, 2025
[Frontend] Add unix domain socket support
#18097 merged Aug 8, 2025
[Docs] fix broken links in metrics.md
#22315 merged Aug 8, 2025
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers
#22536 merged Aug 8, 2025
[Bugfix] Update FA commit hash
#22546 merged Aug 8, 2025
[Misc] DeepGEMM : Avoid JIT generation in the hot-path
#22215 merged Aug 8, 2025
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA
#21691 merged Aug 8, 2025
[gpt-oss] Support tool call and implement MCP tool server
#22427 merged Aug 8, 2025
[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling”
#22466 merged Aug 8, 2025
[gpt-oss] guard import when triton kernel is not installed
#22529 merged Aug 8, 2025
[Benchmark] Add benchmark tool for multi turn conversations
#20267 merged Aug 8, 2025
[gpt-oss] triton kernel mxfp4
#22421 merged Aug 8, 2025
Remove exception for Python 3.8 typing from linter
#22506 merged Aug 8, 2025
[Docs] Improve API docs (+small tweaks)
#22459 merged Aug 8, 2025
[BugFix] Don't cancel asyncio tasks directly from destructors
#22476 merged Aug 8, 2025
[Misc] fix openai version
#22485 merged Aug 8, 2025
[Misc] Begin deprecation of get_tensor_model_*_group
#22494 merged Aug 8, 2025
[CI/Build] Fix multimodal tests
#22491 merged Aug 8, 2025
[bench] Fix benchmark/serve.py to ignore unavailable results
#22382 merged Aug 8, 2025
[Doc] Sleep mode documentation
#22310 merged Aug 8, 2025
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10
#22426 merged Aug 8, 2025
Fix pre-commit
#22487 merged Aug 8, 2025
Optimize MiniCPMO mask creation with vectorized implementation
#22464 merged Aug 8, 2025
not tie_word_embeddings for glm-4.5 and glm-4.5v
#22460 merged Aug 8, 2025
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match
#22065 merged Aug 8, 2025
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000)
#22131 merged Aug 8, 2025
Fix Flashinfer CUTLASS MOE Allgather
#21963 merged Aug 8, 2025
Support Tensorrt-LLM MoE fp4 for low-latency
#21331 merged Aug 8, 2025
Add ModelOpt Qwen3 nvfp4 support
#20101 merged Aug 8, 2025
[PERF] Use pybase64 to more quickly decode prompt embeddings
#22469 merged Aug 8, 2025
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine
#21496 merged Aug 8, 2025
[Misc] normalize multiprocessing Queue usage
#22371 merged Aug 8, 2025
Remove from_dict from SpeculativeConfig
#22451 merged Aug 7, 2025
[Frontend] Use engine argument to control MM cache size
#22441 merged Aug 7, 2025
[Core] Simplify mm processing cache
#22457 merged Aug 7, 2025
Fix pre-commit error in main
#22462 merged Aug 7, 2025
[gpt-oss] Generate ResponseOutputItem from Harmony Message
#22410 merged Aug 7, 2025
[Tool] Fix auto tool call
#22434 merged Aug 7, 2025
[Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM
#22352 merged Aug 7, 2025
[Core] Store only the keys for multi-modal data in P0
#22198 merged Aug 7, 2025
[Docs] Update features/disagg_prefill, add v1 examples and development
#22165 merged Aug 7, 2025
[Doc] update docs for nightly benchmarks
#12022 merged Aug 7, 2025
[Docs] Factor out troubleshooting to its own guide; add section for Ray Observability
#21578 merged Aug 7, 2025
[Doc] Fix link to prefix caching design
#22384 merged Aug 7, 2025
[Misc] Enhance code formatting in mxfp4.py
#22423 merged Aug 7, 2025
Add H20-3e fused MoE kernel tuning configs for GLM-4.5
#22433 merged Aug 7, 2025
[Docs] Add missing dependency for docs build
#22435 merged Aug 7, 2025
feat: Add --enable-log-outputs flag for logging model generations
#20707 merged Aug 7, 2025
[Misc] Support routing logic simulation
#21990 merged Aug 7, 2025
[Frontend] Update OpenAI error response to upstream format
#22099 merged Aug 7, 2025
[Model] Switch to Fused RMS norm in Qwen2.5_VL model.
#22184 merged Aug 7, 2025
[Bench] Split serve.py:main into async/async versions
#22405 merged Aug 7, 2025
[CI] Skip the pooling models that do not support transformers v4.55
#22411 merged Aug 7, 2025
[Bugfix] EPLB load statistics problem
#22167 merged Aug 7, 2025
[gpt-oss] Convert user input to harmony format
#22402 merged Aug 7, 2025
preload heavy modules when mp method is forkserver
#22214 merged Aug 7, 2025
Optimize logger init performance by using module-level constants
#22373 merged Aug 7, 2025
Update hf_xet pin to resolve hangs
#22356 merged Aug 7, 2025
[Bugfix] Add proper comparison for package versions
#22314 merged Aug 7, 2025
[Bugfix]: Fix the streaming output for function calls in the minimax
#22015 merged Aug 7, 2025
Use float32 for test_completion.py
#22385 merged Aug 7, 2025
[Bugfix] Fix wrong method name in Intern-S1 image processor
#22417 merged Aug 7, 2025
[Qwen3] Enable dual-chunk-attention support for Qwen3 models.
#21924 merged Aug 7, 2025
[XPU]Fix flash_attn_varlen_func interface on xpu
#22350 merged Aug 7, 2025
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix
#21588 merged Aug 7, 2025
Support encoder_only attention for FlexAttention
#22273 merged Aug 7, 2025
[model] Support MiniCPM-V 4.0
#22166 merged Aug 7, 2025
Update flashinfer-python==0.2.10
#22389 merged Aug 7, 2025
Fix trtllm-gen attention env and add attention sink
#22378 merged Aug 7, 2025
[gpt-oss] fix model config with hf_config
#22401 merged Aug 7, 2025
[gpt-oss] add demo tool server
#22393 merged Aug 7, 2025
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue
#22399 merged Aug 7, 2025
[v1] - Mamba1 Attention Metadata
#21249 merged Aug 7, 2025
[gpt-oss] flashinfer mxfp4
#22339 merged Aug 6, 2025
[gpt-oss] attention sink init fix gemini
#22335 merged Aug 6, 2025
[gpt-oss] Add loop for built-in tool call
#22374 merged Aug 6, 2025
[Bugfix] Make condition in triton kernel constexpr
#22370 merged Aug 6, 2025
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks
#22368 merged Aug 6, 2025
add the codes to check AMD Instinct GPU number
#22367 merged Aug 6, 2025
[BugFix] Fix FA2 RuntimeError when sinks is provided
#22365 merged Aug 6, 2025
[Minor] Fix type
#22347 merged Aug 6, 2025
[gpt-oss] Support chat completion api
#22342 merged Aug 6, 2025
[gpt-oss] add model to supported models doc
#22336 merged Aug 6, 2025
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils
#22340 merged Aug 6, 2025
[Misc] Clean up duplicated hf overrides
#22311 merged Aug 6, 2025
[gpt-oss] Add openai-harmony as default dependency
#22332 merged Aug 6, 2025
[gpt-oss] flashinfer attention sink init
#22330 merged Aug 6, 2025
[GptOss] Add GptOss reasoning parser to support structure output
#22322 merged Aug 6, 2025
[ROCm] Add attention sink to use_rocm_custom_paged_attention
#22329 merged Aug 6, 2025
Add GPT-OSS model code and config [1/N]
#22327 merged Aug 6, 2025
Update transformers to v4.55
#21931 merged Aug 6, 2025
Add attention sink in attention backends
#22320 merged Aug 6, 2025
Increase openai-python version
#22316 merged Aug 6, 2025
Upgrade FA3 for attention sink
#22313 merged Aug 6, 2025
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm
#22264 merged Aug 6, 2025
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation
#22275 merged Aug 6, 2025
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding
#21862 merged Aug 6, 2025
[Bugfix] Fix MoE BNB version
#22260 merged Aug 6, 2025
[Bugfix] Fix 3D input passed into cutlass_scaled_mm
#22278 merged Aug 6, 2025
[Bugfix] Remove faulty test for oot attention backend
#22286 merged Aug 6, 2025
[CI][TPU] Fix docker clean up
#22271 merged Aug 5, 2025
[bugfix] fix blackwell deepep installation
#22255 merged Aug 5, 2025
[V1] port xformers backend to v1
#21342 merged Aug 5, 2025
[Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING
#22199 merged Aug 5, 2025
[CI/Build] Update flashinfer to 0.2.9
#22233 merged Aug 5, 2025
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail
#22128 merged Aug 5, 2025
[V0 Deprecation][TPU] Remove V1 flag check from tests
#22248 merged Aug 5, 2025
[Misc] correct static type check for GroupCoordinator
#21946 merged Aug 5, 2025
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel
#22095 merged Aug 5, 2025
[Feature] Non-contiguous Support for FP8 Quantization
#21961 merged Aug 5, 2025
Migrate KimiVLImagePixelInputs to TensorSchema
#21769 merged Aug 5, 2025
[Docs][TPU] Highlight TPU Software version selection
#22242 merged Aug 5, 2025
[Model] Pooling model activation supports per request control by PoolingParams
#20538 merged Aug 5, 2025
[Core] Factor out common logic for MM budget calculation
#22228 merged Aug 5, 2025
[UX] Fail if an invalid attention backend is specified
#22217 merged Aug 5, 2025
[Bugfix] Misaligned params in TreeAttentionImpl
#22226 merged Aug 5, 2025
Optimize configuration access with LRU cache in custom ops
#22204 merged Aug 5, 2025
[Misc] log more detailed message for ensure_model_parallel_initialized
#22144 merged Aug 5, 2025
[Doc] add backend to doc string of initialize_model_parallel
#22142 merged Aug 5, 2025
[Misc] Remove pass_config from CompilationConfig dump_json excluded
#21911 merged Aug 5, 2025
fix: kimi_k2 return empty tool call list
#22149 merged Aug 5, 2025
[Log] DeepGEMM Update Log for Unaligned Problem Size
#22208 merged Aug 5, 2025
self.gate dtype update for GLM-4.5
#22203 merged Aug 5, 2025
[ROCm][Bugfix] Compilation passes fix
#22202 merged Aug 5, 2025
[FEAT] Refactor ROPE into module
#22192 merged Aug 5, 2025
[V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2)
#21785 merged Aug 5, 2025
[V1] reduce block size for tree attention correctness test to fix 'ou…
#22207 merged Aug 5, 2025
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling."
#22223 merged Aug 5, 2025
[Bugfix] V1 Fix the cursor leakage issue during request scheduling.
#21173 merged Aug 5, 2025
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading
#22073 merged Aug 5, 2025
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector
#21819 merged Aug 4, 2025
[Bug] Update auto_tune.sh to separate benchmarking and profiling.
#21629 merged Aug 4, 2025
[Responses API] Ignore store=True and process the request by default
#22185 merged Aug 4, 2025
Fix Arcee model weight loading: Add custom load_weights
#21725 merged Aug 4, 2025
[Doc] Update pooling model docs
#22186 merged Aug 4, 2025
[Sampler] Support returning all logprobs or logits
#21792 merged Aug 4, 2025
[Bugfix] Fix failing GGUF models test
#22174 merged Aug 4, 2025
[feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading
#21164 merged Aug 4, 2025
[Misc] Modify the organization of GLM series
#22171 merged Aug 4, 2025
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes
#22163 merged Aug 4, 2025
Remove index_put from MM embeddings merging
#22105 merged Aug 4, 2025
[refactor] improve ConstantList exception specificity
#22156 merged Aug 4, 2025
Add tree attention backend for v1 (part 1)
#20401 merged Aug 4, 2025
remove duplicate code within cleanup_dist_env_and_memory
#22147 merged Aug 4, 2025
[PD] add test for chat completions endpoint
#21925 merged Aug 4, 2025
[RLHF] Fix torch.dtype not serializable in example
#22158 merged Aug 4, 2025
[fix] fix correct assertion syntax error in attention utils.
#22154 merged Aug 4, 2025
Use aiohttp connection pool for benchmarking
#21981 merged Aug 4, 2025

123 Pull requests opened by 97 people

[Misc]add replicaid to ray metrics
#22159 opened Aug 4, 2025
[Misc] Minor fixes and cleanups for elastic EP
#22160 opened Aug 4, 2025
[Bugfix] Support full cuda graph with sliding window attention
#22168 opened Aug 4, 2025
[Model][V1] Support Ernie MTP
#22169 opened Aug 4, 2025
[Bugfix] Fix erroneous randomly generated cases in bad word testing
#22170 opened Aug 4, 2025
Enable multi-stream for Llama4 q/k_norm and MoE
#22175 opened Aug 4, 2025
[Performance] EPLB Execution Optimization
#22179 opened Aug 4, 2025
enable dp for custom devices on ray executor.
#22181 opened Aug 4, 2025
[P/D][Nixl] Introduce `KVTransferMetrics` and aggregation strategy
#22188 opened Aug 4, 2025
[Model] Mamba models - Support FP32 SSM cache
#22196 opened Aug 4, 2025
[Misc] Update HunYuan dense model test
#22200 opened Aug 4, 2025
[Misc] Improve Worker process title
#22205 opened Aug 4, 2025
[Bugfix] fix hash error for chunked local attention hybrid KV
#22209 opened Aug 4, 2025
[Perf] Support topk softmax fused kernel for broader num_experts
#22211 opened Aug 4, 2025
feat: Add native support for XLM-RoBERTa embedding and BAAI/bge-reranker-v2-m3
#22216 opened Aug 4, 2025
[Frontend] Added parallel_tool_calls option on the openai API with Guided Decoding
#22218 opened Aug 4, 2025
[Quant] Refactor CompressedTensorsConfig
#22219 opened Aug 4, 2025
Fp8 paged attention update
#22222 opened Aug 5, 2025
[Bugfix] Disable the statslogger if the api_server_count is greater than 1
#22227 opened Aug 5, 2025
[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy
#22236 opened Aug 5, 2025
fix(worker): adjust memory requirement calculation for GPU worker
#22237 opened Aug 5, 2025
[V1][SpecDecode]Support Relaxed Acceptance for thinking tokens in speculative decoding when using greedy search, camp up by Nvidia.
#22238 opened Aug 5, 2025
[Platform] allow platform to init dp group
#22243 opened Aug 5, 2025
[Misc] benchmark_moe supports expert parallel
#22251 opened Aug 5, 2025
[BugFix] Fix port lookup in internal DP LB tests
#22252 opened Aug 5, 2025
[TPU][Misc] Fix TPU.device_name
#22254 opened Aug 5, 2025
[Misc] Use comma-separated string for --kernels to avoid greedy consumption of the subparser command
#22258 opened Aug 5, 2025
Support gpt-oss
#22259 opened Aug 5, 2025
[Fix] apply_temperature() casuing `inf` logits
#22261 opened Aug 5, 2025
[V1][Spec Decode] Async scheduling integration with spec decode
#22262 opened Aug 5, 2025
Il tool compare tool
#22263 opened Aug 5, 2025
Support conditional torch.compile per module
#22269 opened Aug 5, 2025
`NixlConnector` Support HTTP/S metadata exchange instead of zmq
#22274 opened Aug 5, 2025
[Frontend] Add chunked processing to handle long inputs in embedding models
#22280 opened Aug 5, 2025
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model
#22281 opened Aug 5, 2025
Only convert output to weakref for last graph across all compilation units
#22282 opened Aug 5, 2025
[Core] Return final response for aborted requests from `AsyncLLM.generate`
#22283 opened Aug 5, 2025
[Feat] Allow custom `comm_group` in ParallelLinear layers
#22309 opened Aug 6, 2025
Removing redundant installation of torch for CPU build
#22318 opened Aug 6, 2025
[V1] Test cases for parallel sampling with all output_kind options
#22321 opened Aug 6, 2025
[Not Merged] gpt-oss rc1 rebased
#22328 opened Aug 6, 2025
fix(cuda): add arch 8.9 to support NVIDIA L4 GPU
#22343 opened Aug 6, 2025
[Model]Force use triton compressed_tensor_moe instead of cutlass
#22345 opened Aug 6, 2025
[Kernel] Add nvfp4 gemm flashinfer backends
#22346 opened Aug 6, 2025
[Model] NemotronH Support
#22349 opened Aug 6, 2025
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated
#22355 opened Aug 6, 2025
NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend
#22357 opened Aug 6, 2025
[Bugfix] fixed the fp8liner init
#22360 opened Aug 6, 2025
Improved defaulting of chunked prefill and prefix caching in V1
#22362 opened Aug 6, 2025
[BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0
#22369 opened Aug 6, 2025
Layered Dockerfile for smaller size and faster image pulling
#22377 opened Aug 6, 2025
[gpt-oss] tool parser supports for /chat/completions [1/n]
#22386 opened Aug 6, 2025
[Sampler] Support returning final logprobs
#22387 opened Aug 6, 2025
N gram
#22390 opened Aug 6, 2025
Ability to use custom-all-reduce on systems with more than 2 PCIe GPUs via env var
#22392 opened Aug 6, 2025
[BugFix] Don't match ScaledMM patterns unless torch._scaled_mm is available
#22400 opened Aug 6, 2025
[Metrics] Expose num_requests_waiting_by_priority metric
#22404 opened Aug 6, 2025
[feature] add all_reduce registry for custom backends
#22406 opened Aug 6, 2025
[XPU] support ray distribute executor on XPU
#22413 opened Aug 7, 2025
WIP - Hack xpu memory - my attempt at fixing issue #20743
#22415 opened Aug 7, 2025
[Misc] Add pre-commit __init__.py check
#22418 opened Aug 7, 2025
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel
#22428 opened Aug 7, 2025
[Bugfix] Bypass `is_kv_cache_type_uniform check` when kvcache is disabled
#22429 opened Aug 7, 2025
[gpt-oss] Support streaming in response API
#22431 opened Aug 7, 2025
[XPU][P/D] Add XPU support in NixlConnector
#22436 opened Aug 7, 2025
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP
#22437 opened Aug 7, 2025
[Speculators][Speculative Decoding] Add Eagle3 support for Qwen2
#22438 opened Aug 7, 2025
Fixed vllm build and runtime openai tests on ppc64le
#22443 opened Aug 7, 2025
[Model] use autoWeightsLoader for gptoss
#22446 opened Aug 7, 2025
support silu+nvfp4 quant fusion
#22448 opened Aug 7, 2025
[Bugfix] Added more env vars to hash
#22449 opened Aug 7, 2025
Fix nvfp4 swizzling
#22450 opened Aug 7, 2025
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` implementations
#22456 opened Aug 7, 2025
[Feature] add procese set cpu affinity current gpu device
#22461 opened Aug 7, 2025
[Quantization]: Support compressed-tensors mixed-precision model loading
#22468 opened Aug 7, 2025
vllm fix check on max vocab size
#22471 opened Aug 7, 2025
[V1][P/D]Bug fix: handle edge case where KVConnectorOutput is None
#22473 opened Aug 7, 2025
[Refactor] Refactor FP8 & INT8 Quant Folder inside `8bit`
#22474 opened Aug 7, 2025
[Attention] FA3 Attention Sinks Perf Boost
#22478 opened Aug 8, 2025
[Structured Output] Make the output of structured output example more complete
#22481 opened Aug 8, 2025
[Transform] [Quantization] Add transforms to compressed tensors
#22486 opened Aug 8, 2025
Feat/sliding window metrics — Related to #22480
#22488 opened Aug 8, 2025
consistency between the test and final Docker image
#22490 opened Aug 8, 2025
[CI] Add end-to-end V1 min_tokens test coverage
#22495 opened Aug 8, 2025
[Debugging] Add annotation for easier trace analysis
#22496 opened Aug 8, 2025
[XPU] Fix OOM issue for data parallel with Ray backend
#22500 opened Aug 8, 2025
[Platform] Custom ops update
#22509 opened Aug 8, 2025
Fix Llama4 FlashInfer FP4 MoE issues
#22511 opened Aug 8, 2025
[gpt-oss] Small bug fixes for frontend
#22512 opened Aug 8, 2025
[WIP][Model] Add Ernie4.5 VL Model Support
#22514 opened Aug 8, 2025
code clean when clean shm
#22516 opened Aug 8, 2025
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module.
#22521 opened Aug 8, 2025
[WIP] [Bench] Add Triton NVFP4 GEMM
#22523 opened Aug 8, 2025
[Fix] fix offline env use local mode path
#22526 opened Aug 8, 2025
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs
#22527 opened Aug 8, 2025
Fix torch version check for mxfp4
#22535 opened Aug 8, 2025
New moe quant config
#22537 opened Aug 8, 2025
[Kernel] Add cuda kernel for gpt_oss activation
#22538 opened Aug 8, 2025
add tg-mxfp4-moe-test
#22540 opened Aug 8, 2025
[WIP, Do not review] DI
#22542 opened Aug 8, 2025
[Bugfix][V1] Fix Finished Request Handling in Async Scheduling
#22543 opened Aug 8, 2025
[V1][Hybrid][Backend] Allow different data types in Hybrid Cache Manager
#22544 opened Aug 8, 2025
[Misc] fail fast when exception is raised in in_the_same_node_as
#22553 opened Aug 8, 2025
[gpt-oss] Add test for response API + harmony (but skipped)
#22554 opened Aug 9, 2025
[BugFix] EAGLE Load Bias From Config
#22558 opened Aug 9, 2025
Frontend: Adding LM Format Enforcer support to V1 engine
#22564 opened Aug 9, 2025
[Core] Use individual MM items in P0/P1 cache and model runner
#22570 opened Aug 9, 2025
optimize: improve scheduler policy lookup performance
#22573 opened Aug 9, 2025
[Core][BugFix] Fix thread safety issue in RequestOutputCollector
#22576 opened Aug 9, 2025
Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu)
#22577 opened Aug 10, 2025
[CI/Build] Fix tensorizer test for load_format change
#22583 opened Aug 10, 2025
[Misc][gpt-oss] guard import when triton kernel when not up to date
#22584 opened Aug 10, 2025
Add return_token_ids_alongside parameter to OpenAI API endpoints
#22587 opened Aug 10, 2025
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models
#22589 opened Aug 10, 2025
[BugFix] Fix logits repetition penalty cuda check
#22592 opened Aug 10, 2025
[V1] [Hybrid] Enable Full CUDA graph by default for models with mamba2 layers in V1
#22594 opened Aug 10, 2025
v1: Offloading connector
#22595 opened Aug 10, 2025
[BugFix] Fix KVConnectorOutput TPU breakage
#22598 opened Aug 10, 2025
[Feature] Improve logging for error messages
#22599 opened Aug 10, 2025
[Misc][gpt-oss] Add rules to label gpt-oss related PRs
#22600 opened Aug 10, 2025
[Docs] Add comprehensive CLI reference for all large `vllm` subcommands
#22601 opened Aug 10, 2025
Vectorize RMSNorm CUDA kernel
#22602 opened Aug 10, 2025
minor: zero workspace buffer init for flashinfer trtllm-gen attn
#22603 opened Aug 10, 2025

172 Issues closed by 37 people

[Feature]: Serving OpenAI model in bf16
#22277 closed Aug 10, 2025
[Doc]:
#22582 closed Aug 10, 2025
[Docs] Feedback for `/en/latest/examples/offline_inference/async_llm_streaming.html`
#22581 closed Aug 10, 2025
[Docs] Feedback for `/en/latest/examples/offline_inference/async_llm_streaming.html`
#22580 closed Aug 10, 2025
[Bug]: No output / Repeated outputs when using Gemma 3 on vLLM
#20341 closed Aug 10, 2025
AttributeError: 'Gemma3TextConfig' object has no attribute 'interleaved_sliding_window'
#22270 closed Aug 10, 2025
[Bug]: Major issues with transformers version causing rubbish generations with Gemma3 family using vllm
#22475 closed Aug 10, 2025
[Bug]: Gemma-3-12B-it model getting stuck in repetitive output loops
#15752 closed Aug 10, 2025
[Bug]: gemma3 shows degraded accuracy in vLLM v0.8.4
#17689 closed Aug 10, 2025
[Bug]: Worker VllmWorkerProcess pid 000000 died, exit code: -15
#15295 closed Aug 10, 2025
[Bug]: Stuck When Launching Llama-4-Maverick-17B-128E-Instruct-FP8
#16152 closed Aug 10, 2025
[Usage]: Multiple Models on Same Port
#16232 closed Aug 10, 2025
[Performance]: FP8 does not demonstrate an inference speed superior to that of FP16
#16261 closed Aug 10, 2025
[Performance]: H100 Optimisation Configuration For Offline Inferencing
#16265 closed Aug 10, 2025
[Usage]: how to redirect save logs to local file.
#16319 closed Aug 10, 2025
[Bug]: Used VRAM is less than GPU memory utilization on a heterogeneous setup
#16382 closed Aug 10, 2025
[Bug]: I use VLLM to deploy DeepSeek, and when there are high concurrency requests, many of them are pending reqs. How can I set a timeout for these waiting requests?
#16391 closed Aug 10, 2025
[Bug]: AMD Instinct MI210 + vllm fail to run the official deepseek-r1 model: ValueError("type fp8e4b8 not supported in this architecture. The supported fp8 dtypes are ('fp8e5',)")
#16394 closed Aug 10, 2025
[Usage]: --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
#16395 closed Aug 10, 2025
[Bug]: corrupted double-linked list (not small) Aborted
#16412 closed Aug 10, 2025
[Usage]: CUDA error: invalid device ordinal When Run on 2*H200 with -tp 4 and --distributed-executor-backend mp
#16456 closed Aug 10, 2025
[Performance]: MultiModalKwargs serialization has significant impact on E2E latency (w/ proof-of-concept patch)
#16461 closed Aug 10, 2025
[Feature]: Refactor config.py for non-cuda backend
#16468 closed Aug 10, 2025
[Usage]: How to implement Streaming Input
#16469 closed Aug 10, 2025
[Usage]: failed to run PD LMCache example code
#16471 closed Aug 10, 2025
[Bug]: Medusa speculation hangs when tp > 1
#16477 closed Aug 10, 2025
[Usage]: Inferencing with DeepSeek R1
#16498 closed Aug 10, 2025
[New Model]: Gemma 3n support
#18476 closed Aug 9, 2025
Generate nothing from VLLM output
#1185 closed Aug 9, 2025
[Bug]: LLMEngine.add_request can't handle erroneous type of request_id
#19588 closed Aug 9, 2025
[CI Failure]: test_hybrid.py::test_models undefined symbol: _ZN3c104cuda9SetDeviceEab
#22395 closed Aug 9, 2025
[Usage]: vLLM support for FP8 models (QWEN3 FP8) on RTX 50 series / SM120
#21648 closed Aug 9, 2025
[Bug]: ValueError:Could not broadcast input array from shape (542,) into shape (512,)
#9963 closed Aug 9, 2025
[Usage]: `Phi-4-multimodal-instruct` activate LoRA module but get mangled text output
#15440 closed Aug 9, 2025
[Bug]: Unable to run Phi4 with tensor-parallel-size 4 torch.compile compatiblity
#16021 closed Aug 9, 2025
[RFC]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22549 closed Aug 8, 2025
[Feature]: support binding on Unix Domain Sockets (UDS)
#13907 closed Aug 8, 2025
[RFC]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22547 closed Aug 8, 2025
[Usage]:
#22552 closed Aug 8, 2025
[RFC]:
#22548 closed Aug 8, 2025
[RFC]: vllm 0.10.1gpt-oss，speculative sampling be use？
#22550 closed Aug 8, 2025
[Bug]: Dependencies versions not pinned in tools/ep_kernels/install_python_libraries.sh
#22467 closed Aug 8, 2025
[CI Failure]: V1 Test
#22172 closed Aug 8, 2025
[Bug]: Failed to launch the vllm 0.9.2 v1 server: torch._inductor.exc.InductorError: SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
#20881 closed Aug 8, 2025
[Bug]: Gemma-3 (27B) can't load save_pretrained() checkpoint: AssertionError: expected size 5376==2560, stride 1==1 at dim=0
#15836 closed Aug 8, 2025
[Bug]: Smollm3M not working anymore
#22517 closed Aug 8, 2025
[Bug]: gemma3_text has interleaved attention, which is currently not supported by the FLASHINFER backend. Disabling sliding window and capping the max length to the sliding window size (1024).
#20865 closed Aug 8, 2025
[Bug] when I run the gemma3-27B model with local attention: global attention=5:1 and input 128K, the program does not interrupt?
#20715 closed Aug 8, 2025
[Usage]: 如何使用 vllm 进行跨模态的离线推理
#19429 closed Aug 8, 2025
[Bug]: vLLM serve `google/gemma-3-1b-it` with version `0.8.5` interrupted `SIGTERM`
#17386 closed Aug 8, 2025
[Bug]: vllm部署Gemma-3-27b问题
#16380 closed Aug 8, 2025
[Bug]: gemma 3 structured output api occurs assertion error
#15766 closed Aug 8, 2025
[Bug]: vllm 0.8.3 serve error
#15457 closed Aug 8, 2025
[Bug]: Gemma-3-27b-it-GPTQ Can't run in sm75, vllm-0.7.4.dev
#14814 closed Aug 8, 2025
[Feature]: gemma3 raise error
#14723 closed Aug 8, 2025
[Bug]: reasoning parser for Qwen/Qwen3-4B-Thinking-2507
#22507 closed Aug 8, 2025
[Bug]: ImportError: cannot import name 'ActionFind' from 'openai.types.responses.response_function_web_search'
#22482 closed Aug 8, 2025
[Bug]: V100启动卡住
#22503 closed Aug 8, 2025
[Bug]:Deploying GPT-OSS-20B: EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
#22414 closed Aug 8, 2025
[Bug]: `vllm bench serve` KeyError during saving results, if specifying non-default `--percentile-metrics`
#22381 closed Aug 8, 2025
[Bug]: OSS-20B [backend_xgrammar.py:160] Failed to advance FSM
#22483 closed Aug 8, 2025
[Bug]: RuntimeError: Index put requires the source and destination dtypes match
#22064 closed Aug 8, 2025
[Bug]: EPLB load statistics problem
#21883 closed Aug 8, 2025
[Bug]: Not support MiniCPM-o 2.6 ‘s finetune lora
#13018 closed Aug 8, 2025
[Feature]: deepseek-r1-w8a8
#14176 closed Aug 8, 2025
[Bug]: Disaggregated Prefilling use different TP between prefill instance and decode instance , it will be hanged
#14952 closed Aug 8, 2025
[Bug]: Speculative decoding with a draft model makes generation slower
#15025 closed Aug 8, 2025
DeciLMConfig object has no attribute ‘num_key_value_heads_per_layer’
#15625 closed Aug 8, 2025
[Bug]: Compliing for CPU fails due to wrong python setup call
#15953 closed Aug 8, 2025
[Bug]: creating helm chart with vllm image and downloaded model throws raise KeyboardInterrupt("terminated") KeyboardInterrupt: terminated error
#15997 closed Aug 8, 2025
[Usage]: Performance Comparison: 1x8 (TP=8) vs 2x4 (TP=4) in vLLM - Why Does 1x8 Outperform 2x4 in Concurrency?
#16011 closed Aug 8, 2025
[Usage]: Llama-3.1-8B-Instruct
#16207 closed Aug 8, 2025
[New Model]: LGAI-EXAONE/EXAONE-Deep-7.8B GGUF
#16224 closed Aug 8, 2025
[Bug]: (Maybe) Input preprocessing blocks the async operations
#17883 closed Aug 8, 2025
[Usage]: ERROR:root:Compiled DAG task exited with exception
#16242 closed Aug 8, 2025
[Usage]: Failed to get global TPU topology.
#16243 closed Aug 8, 2025
[New Model]: efficient-speech/lite-whisper-large-v3
#16244 closed Aug 8, 2025
[Usage]: Async generate with offline LLM interface
#16251 closed Aug 8, 2025
[Usage]: How to use xPyD disaggregated prefilling
#16253 closed Aug 8, 2025
[Feature]: Will you add padding for intermediate_size just like lmdeploy?
#16260 closed Aug 8, 2025
[Feature]: ray logs too large
#16262 closed Aug 8, 2025
[Bug]: invalid responses when generating yaml format
#16269 closed Aug 8, 2025
[Usage]: Force output for log probabilities
#16281 closed Aug 8, 2025
[Bug]: Deepseek R1 gives nonsensical tokens during offline Inference
#16285 closed Aug 8, 2025
[Feature]: Can you provide a formula for gpu memory usage when deploying a model using VLLM, while providing the number of model parameters, online text length, batchsize, and quantization accuracy
#16297 closed Aug 8, 2025
[Performance]: Performance degradation with tp=8 compared to tp=4 on 8xA100(80G)
#16300 closed Aug 8, 2025
[Usage]: When performing inference with vLLM, it keeps getting stuck at 0%.
#16303 closed Aug 8, 2025
[Usage]: OpenAI client to vllm with lora
#16304 closed Aug 8, 2025
[Usage]: VocabParallelEmbedding raise error embedding: [-130., -130., .....]
#16305 closed Aug 8, 2025
[Bug]: use vllm-0.7.2 version python vllm/benchmarks/serve.py had error: No plugins for group vllm.platform_plugins found. Automatically detected platform cuda.
#16308 closed Aug 8, 2025
[Usage]: TPU Offline benchmark Error TypeError: PlaceholderAttentionMetadata.__init__() got an unexpected keyword argument 'context_lens'
#16310 closed Aug 8, 2025
[Feature]: Add task perplexity mode to optimize PPL evaluation
#16324 closed Aug 8, 2025
[Usage]: it Says not able to read the model architecture
#16326 closed Aug 8, 2025
[Bug]: orch.OutOfMemoryError: CUDA out of memory. Tried to allocate 350.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 212.81
#16343 closed Aug 8, 2025
[Feature]: Soft Prompts?
#16351 closed Aug 8, 2025
[Usage]: vllm serve: how to get KV cache percentage on CPU and GPU?
#16368 closed Aug 8, 2025
[Bug]: vllm server bug with qwen 2.5 gguf
#16374 closed Aug 8, 2025
[Bug]: ROCm NotImplementedError: Speculative decoding is not yet supported on vLLM V1
#21308 closed Aug 8, 2025
[Bug]: Flashinfer 0.2.10 not supported
#22455 closed Aug 7, 2025
[Feature]: Support attention backend with FlexAttention
#7315 closed Aug 7, 2025
[Bug]: TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
#20556 closed Aug 7, 2025
[RFC]: How to use OpenAI's "Reasoning Levels" (low/medium/high) in vLLM?
#22359 closed Aug 7, 2025
[CI Failure]: test_reward.py::test_prm_models - AttributeError: 'DynamicCache' object has no attribute 'get_usable_length'. Did you mean: 'get_seq_length'?
#22398 closed Aug 7, 2025
[Bug]: RuntimeError: _vllm_fa2_C::varlen_fwd() expected at most 21 argument(s) but received 22 argument(s).
#22344 closed Aug 7, 2025
[Bug]: Incorrect version judgment method for flashinfer
#22297 closed Aug 7, 2025
[CI Failure]: test_processing_correctness - AttributeError: 'GotOcr2ImageProcessorFast' object has no attribute 'get_number_of_image_tokens'.
#22397 closed Aug 7, 2025
[feat] vLLM generation deterministic option/flag
#2910 closed Aug 7, 2025
ExLlamaV2: exl2 support
#3203 closed Aug 7, 2025
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 closed Aug 7, 2025
[Bug]: gpu_memory_utilization affects generation quality
#15763 closed Aug 7, 2025
[Bug]: benchmark a vllm cluster error
#22351 closed Aug 7, 2025
[Bug]: Deploying the qwen3-reranker and qwen3-embedding model services using vllm, the service performance is poor. It seems that requests are still processed serially and cannot be handled concurrently?
#22048 closed Aug 7, 2025
[Bug]: docker image vllm/vllm-openai:gptoss can not load gpt-oss-120b on H100. OutOfMemoryError
#22358 closed Aug 7, 2025
[Bug]: gpt oss 401 unauthorized error
#22407 closed Aug 6, 2025
[Bug]: AttributeError("'NoneType' object has no attribute 'type'") from `sink_ptr` of `unified_attention` when running spec dec with Triton Attention Backend V1
#22363 closed Aug 6, 2025
[Bug]: ModelConfig validation error when serving chatgpt-oss models
#22301 closed Aug 6, 2025
[Bug]:
#22304 closed Aug 6, 2025
[Bug]:
#22303 closed Aug 6, 2025
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22306 closed Aug 6, 2025
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22305 closed Aug 6, 2025
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22302 closed Aug 6, 2025
[Bug]: vllm will return empty string when request.stop is not null
#11089 closed Aug 6, 2025
[RFC]: BatchLLM for better shared prefix utilizing in offline scenarios
#12080 closed Aug 6, 2025
[Bug]: deepseek r1 + vllm (v0.7.2) torch.compile error
#13471 closed Aug 6, 2025
[Feature]: DeepSeek v3/r1 MTP support PP
#14005 closed Aug 6, 2025
[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0
#14286 closed Aug 6, 2025
[Bug]: Once again, there is an imbalance in memory usage in pipeline parallel deployment
#14763 closed Aug 6, 2025
[Usage]: Clarification on how to use Greedy Search and then Beam search's Poor Performance in VLLM
#15146 closed Aug 6, 2025
[Bug]: Missing `Xformers` on `rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250311` for pixtral-12b-2409
#15972 closed Aug 6, 2025
[Usage]: How to run multiple models in the same docker instance
#16065 closed Aug 6, 2025
[Bug]: Llama4 does not follow basic instructions properly
#16125 closed Aug 6, 2025
[Bug]: ValueError: BertForMaskedLM has no vLLM implementation and the Transformers implementation is not compatible with vLLM.
#16146 closed Aug 6, 2025
[Bug]: memory usage is greater than expected
#16184 closed Aug 6, 2025
[Installation]: the --mount option requires BuildKit
#16205 closed Aug 6, 2025
[Bug]: RMSNorm not checking for input shape in forward_cuda
#16329 closed Aug 6, 2025
[CI Failure]: Plugin Tests (2 GPUs) - plugins_tests/test_platform_plugins.py::test_oot_attention_backend
#22285 closed Aug 6, 2025
[Feature]: Support for Qwen/Qwen-Image
#22212 closed Aug 5, 2025
[Perf]: Support non-contiguous input for `dynamic_scaled_int8_quant` and `dynamic_per_token_scaled_fp8_quant`
#19630 closed Aug 5, 2025
[Bug]: when tensor-parallel-size>1，Stuck
#8087 closed Aug 5, 2025
[Bug]: Can Not load model Qwen2-VL-72B-Instruct in Vllm
#11608 closed Aug 5, 2025
[Bug]: vLLM gets stuck at pynccl.py during startup
#12292 closed Aug 5, 2025
[Bug]: different logprobs for qwn2-vl when running on transformers and on vllm
#12699 closed Aug 5, 2025
[Bug]: vllm 0.8.0 (not 0.7.3) Qwen VL 2.5 - `No available block found in 60 second.` for hours for a video of 300 frames (cpus 100%, gpu: 0%)
#15136 closed Aug 5, 2025
[V1] [Performance Benchmark] Benchmark the performance of Speculative Decoding
#15600 closed Aug 5, 2025
[Bug]: OpenAI-Compatible Server cannot be requested
#15675 closed Aug 5, 2025
[Bug]: Worker died during distributed inference
#15687 closed Aug 5, 2025
[Bug]: Null response for Mistral3.1
#16014 closed Aug 5, 2025
[Bug]: allow input token logprobs output for multimodal/VLLM
#16107 closed Aug 5, 2025
[Usage]: Asking for help: vllm0.7.2 deploy DeepSeek-R1-int4-gptq-sym-inc
#16111 closed Aug 5, 2025
[Bug]: ROCm: ModelScope models (e.g., Llama-3-8B-Instruct, Qwen3-8B) fail to run on vllm v0.10.0, but work on v0.8.5.
#22084 closed Aug 5, 2025
[Feature]: Support loading GGUF from a remote repo directly
#22210 closed Aug 4, 2025
[Doc]: docs opening speed is much slower than before
#22182 closed Aug 4, 2025
[Bug]: vllm ignores existing pytorch installation
#21745 closed Aug 4, 2025
[Feature]: supporting full `torch.compile` mode (`enforce_eager=False`) for `vllm serve --data-parallel-size` on a single-node
#22090 closed Aug 4, 2025
[CI Failure]: Quantized Models Test - models/quantization/test_gguf.py::test_models
#22136 closed Aug 4, 2025
[Bug]: can not find model error when use docker deploy
#22164 closed Aug 4, 2025
[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication
#9519 closed Aug 4, 2025
[Bug]: Issue with extra [TOOL_CALLS] prefix in function call outputs in chat_with_tools.py
#13473 closed Aug 4, 2025
[Bug]: Gibberish Output from LLaMA 3.1 8B using vLLM with xGrammar
#13828 closed Aug 4, 2025
[New Model]: No supported config format found in deepseek-vl2-small
#14105 closed Aug 4, 2025
Does VLLM support structured pruning?
#15854 closed Aug 4, 2025
[Doc]: What version of vllm and lmcache does that example use https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py
#15874 closed Aug 4, 2025
[Bug]: v1 0.8.2 gets weird performance results
#15947 closed Aug 4, 2025
[Bug]: TypeError: __init__() missing 1 required positional argument: 'inner_exception'
#16009 closed Aug 4, 2025
[Doc]: Steps to run 2 different models on Kaggle GPUs using vllm
#16051 closed Aug 4, 2025
[Bug]: CI flake - v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
#16053 closed Aug 4, 2025
[Bug]: CI flake - v1/engine/test_async_llm.py::test_abort - assert has_unfinished_requests()
#16054 closed Aug 4, 2025
[Bug]: KeyError: 'local_attn_masks' on running gemma3 models with kv-cache quantization
#16061 closed Aug 4, 2025
[Bug]: vllm server stops after torch.compile step for multi-gpu setup
#16092 closed Aug 4, 2025
[Feature]: Will beam search utilize CUDA graphs to improve its speed in the future?
#16099 closed Aug 4, 2025
[Feature]: wheel for vllm swiftkv
#16108 closed Aug 4, 2025
[Doc]: Syntax error of example code in structured_outputs.md
#21914 closed Aug 4, 2025

115 Issues opened by 109 people

[Feature]: gpt-oss tool parser
#22604 opened Aug 10, 2025
[Bug]: CPU penalty operations fail on CUDA-capable systems
#22591 opened Aug 10, 2025
[Bug]: ROCm build falls back to default arch despite ARG_PYTORCH_ROCM_ARCH set in Dockerfile.rocm
#22590 opened Aug 10, 2025
[Bug]: [gpt-oss-120b] Chat Completions endpoint tool_call support is not working
#22578 opened Aug 10, 2025
[Bug]: Vllm hangs when I use the offline engine with dp = 2 or more
#22575 opened Aug 9, 2025
[Bug]: Llama3.1 8B failing to load, _tensor has no operation split, v0.9.2 & v .10.0
#22572 opened Aug 9, 2025
[Feature]: subfolder parameter for EngineArgs
#22569 opened Aug 9, 2025
[Usage]: VRAM spike while loading gemma3-12b bnb on vllm-0.10
#22568 opened Aug 9, 2025
[CI Failure]: Distributed Tests (2 GPUs) - Mllama TP=2 results divergence and deadlock issue
#22559 opened Aug 9, 2025
[Usage]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22551 opened Aug 8, 2025
[Bug]: Stats don't update to zero when all requests are aborted
#22545 opened Aug 8, 2025
[Bug]: gpt-oss-20b flaky BadRequest 400
#22533 opened Aug 8, 2025
[Bug]: NIXL disaggregation example does not work
#22532 opened Aug 8, 2025
[Usage]: How to choose max_num_batched_tokens when chunked prefill is enabled
#22531 opened Aug 8, 2025
[Bug]: (gpt-oss-20b) openai_harmony.HarmonyError: error downloading or loading vocab file
#22525 opened Aug 8, 2025
[Usage]: Custom function-based stopping criteria
#22522 opened Aug 8, 2025
[Bug]: [gpt oss 20b] [tool_call] Unexpected token 12606 while expecting start token 200006
#22519 opened Aug 8, 2025
[Bug]: Fine-tuned DeepSeek-R1-Distill-Qwen-1.5B generates only exclamation marks (token 0) on Ascend NPU
#22518 opened Aug 8, 2025
[Bug]: gpt oss 20b; [serving_chat.py:1001] openai_harmony.HarmonyError: Unexpected token 12606 while expecting start token 200006
#22515 opened Aug 8, 2025
[Bug]: GPT-OSS 20b/120b [backend_xgrammar.py:160] Failed to advance FSM for request
#22513 opened Aug 8, 2025
qwen2.5_omni_7b response audio
#22510 opened Aug 8, 2025
[Usage]: vllm failed to run on multiple gpu
#22505 opened Aug 8, 2025
is support for CohereLabs/command-a-vision-07-2025 available?
#22504 opened Aug 8, 2025
[Bug]: openai/gpt-oss-120b can't run on A100
#22502 opened Aug 8, 2025
[Usage]: Running a 300-400B Parameter Model on Multi-Node Setup (2x 8xA100)
#22501 opened Aug 8, 2025
[Bug]: Crashed when loading ggml quantized Gemma3
#22497 opened Aug 8, 2025
[RFC]: Should the gpt-oss reasoning parser use harmony directly?
#22493 opened Aug 8, 2025
[Bug]: HF_HUB_OFFLINE Parameter does not take effect
#22492 opened Aug 8, 2025
[Bug]: 请问用同样的模型qwen2.5-72b-int4，在单卡和双卡上输出结果会不一样
#22484 opened Aug 8, 2025
[Feature]: Add Moving Average Statistics for Better Performance Monitoring
#22480 opened Aug 8, 2025
[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10
#22479 opened Aug 8, 2025
[Bug]: Trailing newline in prompt affects output
#22477 opened Aug 8, 2025
[Performance]: Very low prefill speed on CPU with BART (encoder-decoder)
#22472 opened Aug 7, 2025
[Bug]: gpt oss 20/120b generates wired characters and fails later when i use them
#22470 opened Aug 7, 2025
[Usage]: can't seem to use batching correcly
#22465 opened Aug 7, 2025
[Feature]: Please add support for Amd Instinct MI50
#22458 opened Aug 7, 2025
[Bug]: Low time to first token of prefill and decode instances but high TTFT with 1p1d
#22454 opened Aug 7, 2025
[Bug]: Qwen3 V1 custom rotary embedding ops breaks torch compile graph
#22453 opened Aug 7, 2025
[Doc]: Data Parallel script gives `RuntimeError: NCCL error`
#22452 opened Aug 7, 2025
[Usage]: Can runtime static values be retrieved for `AsyncLLM`
#22447 opened Aug 7, 2025
[Feature]: Support Qwen model on AWS Neuron
#22445 opened Aug 7, 2025
[Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings
#22444 opened Aug 7, 2025
[Usage]: Whether to support benchmark service stress testing of embeddings model？
#22442 opened Aug 7, 2025
[Feature]: Support Eagle Draft Model with different number of KV heads
#22432 opened Aug 7, 2025
[Bug]: PP+PD NixlConnector failed
#22430 opened Aug 7, 2025
[Bug]: Voxtral-Small-24B-2507 Does Not Support Pipeline-Parallel
#22424 opened Aug 7, 2025
[Feature]: mxfp4 support for 3090
#22422 opened Aug 7, 2025
[Feature]: Add BF16/U8 support for Apple silicon.
#22420 opened Aug 7, 2025
[Bug]: Potential Integer Overflow in Operators
#22419 opened Aug 7, 2025
[Bug]: mistralai/Mixtral-8x7B-Instruct-v0.1 will not load from local path
#22416 opened Aug 7, 2025
[Bug]: When accessing the API with the 'stop' parameter, the 'qwen3-reasoning-parser' fails to function correctly.
#22412 opened Aug 7, 2025
[Performance]: n-gram speculative decoding drafting optimizations
#22408 opened Aug 6, 2025
[Bug]: For GPT OSS 120B: Expected 2 output messages (reasoning and final), but got 7.
#22403 opened Aug 6, 2025
[CI Failure]: Blackwell Test - toomanyrequests: Data limit exceeded
#22396 opened Aug 6, 2025
[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'sinks'
#22383 opened Aug 6, 2025
[Feature]: serve gpt-oss BF16 weights
#22380 opened Aug 6, 2025
[Bug]: Model architectures ['GptOssForCausalLM'] failed to be inspected.
#22376 opened Aug 6, 2025
[Bug]: vLLM crashes on B200 when attempting to load Qwen3 due to `AttributeError: module 'vllm.envs' has no attribute 'VLLM_USE_TRTLLM_ATTENTION'`
#22372 opened Aug 6, 2025
[Bug]: process audios in pass audio in video with qwen2.5-omni-7b
#22364 opened Aug 6, 2025
[Bug]: gpt-oss-120b does not support hybrid data parallel and tensor parallel
#22361 opened Aug 6, 2025
[Bug]: gpt-oss-20b, oss architecture not recognised with transformers
#22353 opened Aug 6, 2025
[Bug]: When using streaming output, the error 'Caught handled exception, but response already started' occurs.
#22341 opened Aug 6, 2025
[Usage]: gpt-oss-120b tool calls
#22337 opened Aug 6, 2025
[Feature]: add --reasoning_parser flag for gpt-oss
#22334 opened Aug 6, 2025
[Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb)
#22331 opened Aug 6, 2025
[Bug]: gpt-oss model crashes on NVIDIA B200 with any OpenAI chat completion request
#22325 opened Aug 6, 2025
[Bug]: Error using openai's json_schema
#22324 opened Aug 6, 2025
[Feature]: If I want gpt-oss to be able to call custom tools, how should I set the --tool-call-parser parameter during deployment?
#22308 opened Aug 6, 2025
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP deployment failed
#22307 opened Aug 6, 2025
[New Model]: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B
#22298 opened Aug 6, 2025
[Feature]: EP physical expert load write metrics
#22296 opened Aug 6, 2025
[Feature]: Cleanup NIXLConnector Connections When Nodes Spin Down
#22295 opened Aug 6, 2025
[Feature]: Tune Triton Configs for Qwen3-30B-A3-Fp8 and Bf16
#22294 opened Aug 6, 2025
[Feature]: Optimize RoPE
#22293 opened Aug 6, 2025
[Feature]: Make KVConnector Compatible with HMA
#22292 opened Aug 6, 2025
[Feature]: how to calculate --gpu-memory-utilization for a given max concurrency such as 5 req/s, 20 req/s and 50 req/s
#22291 opened Aug 6, 2025
[Bug]: gpt-oss on Ampere
#22290 opened Aug 6, 2025
[Feature]: MoE DP support for AWQ/GPTQ
#22289 opened Aug 6, 2025
[Bug]: The quantization method mxfp4 is not supported for the current GPU SM75
#22288 opened Aug 5, 2025
[Bug]: gpt-oss-20b sometimes emits reserved tokens
#22287 opened Aug 5, 2025
[Bug] [gpt-oss-20b] [Responses API]: Could not parse header: too many tokens remaining after extracting content-type and recipient
#22284 opened Aug 5, 2025
[Bug]: gpt-oss -> FA3 not detected on RTX 5090 (Blackwell) – Sinks are only supported in FlashAttention 3
#22279 opened Aug 5, 2025
[Bug]: Unknown quantization method: mxfp4
#22276 opened Aug 5, 2025
[Bug]: pass audios in videos in qwen2.5-omni-7b
#22268 opened Aug 5, 2025
[Bug]: Running gpt-oss-120b encounteres an error with huggingface page setup
#22266 opened Aug 5, 2025
[New Model]: OpenAI OSS model support
#22265 opened Aug 5, 2025
[Bug]: I got an `torch._scaled_mm` error using async tp with Ampere GPU
#22250 opened Aug 5, 2025
[Bug]: qwen3 moe fp8 perchannel compressed-tensors model cannot infer
#22249 opened Aug 5, 2025
[RFC]: Kernel Bench Docker
#22247 opened Aug 5, 2025
[RFC]: Dynamic Expert Load Balance with Zero-like-overhead
#22246 opened Aug 5, 2025
[Bug]: VLLM_ROCM_USE_AITER=1 hit device_gemm with the specified compilation parameters does not support this GEMM problem for Qwen3-235B-A22B
#22245 opened Aug 5, 2025
[Bug]: I do not get the same result when predecting one token at a time and when I predecting several tokens at the same time
#22244 opened Aug 5, 2025
[Usage]: How to use disaggregated_prefill? If we must use a proxy server to transfer request from user to prefill instance and then transfer request to decode instance?
#22241 opened Aug 5, 2025
[Performance]: vllm v0.10.0 seems to be much slower than vllm v0.8.5 when using Qwen3-30B-A3B-int4
#22239 opened Aug 5, 2025
[Bug]: VLLM v0.10.0: V1 GPU worker didn't consider num_gpu_blocks_override when calculating self.requested_memory
#22234 opened Aug 5, 2025
[Bug]: DeepSeekV3 type model fails with UnboundLocalError: cannot access local variable 'shared_output'
#22232 opened Aug 5, 2025
[Usage]: Model architectures ['Qwen3ForCausalLM'] are not supported for now.
#22231 opened Aug 5, 2025
[Feature]: Can speculative decoding and prefix caching take effect simultaneously?
#22230 opened Aug 5, 2025
[Bug]: VLLM v0.10.0 failed to deploy the qwen3-30b-moe model. The error is AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'.
#22225 opened Aug 5, 2025
[Feature]: Qwen/Qwen3-4B-AWQ or RedHatAI/Qwen3-4B-quantized.w4a16 CPU support
#22213 opened Aug 4, 2025
[Feature]: consider offering a linux/arm64 build on Docker Hub
#22206 opened Aug 4, 2025
[RFC]: Integrate MPK (Mirage) compiler as an experimental execution backend to vLLM
#22201 opened Aug 4, 2025
[Feature]: Support return partial rollout when abort request
#22197 opened Aug 4, 2025
[Feature]: Support Mixed-Precision KV Cache Configuration
#22195 opened Aug 4, 2025
[Usage]: ValueError: Head size 40 is not supported by MLACommon. Supported head sizes are: [576]. Set VLLM_ATTENTION_BACKEND=FLEX_ATTENTION to use FlexAttention backend which supports all head sizes.
#22193 opened Aug 4, 2025
[Feature]: Deterministic inference between vllm versions
#22191 opened Aug 4, 2025
[Usage]: VLLM_USE_TRITON_FLASH_ATTN=0 does not enable CK flash attention
#22190 opened Aug 4, 2025
[Feature]: support moe model in small gpu ram
#22189 opened Aug 4, 2025
[Bug]: downgrade the wrong platform of pytorch version, should be ppc instead aarch64
#22183 opened Aug 4, 2025
[Bug]: apply_temperature may cause nan in probs
#22180 opened Aug 4, 2025
[Installation]: vllm 0.90 shows aimv2 error code need expert help
#22178 opened Aug 4, 2025
[Usage]: Abnormal GPU usage for FP8 models on Ampere GPUs
#22177 opened Aug 4, 2025
[Bug]: 0.9.0.1:CUDA out of memory when setting limit_mm_per_prompt={"image":10,"video":1} with Qwen2.5-VL-72B-GPTQ-Int4
#22173 opened Aug 4, 2025
[Usage]: multi-lora for vision language model
#22162 opened Aug 4, 2025
[Performance]: TTFT
#22161 opened Aug 4, 2025

360 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Model] Pooling models default to using chunked prefill & prefix caching if supported.
#20930 commented on Aug 10, 2025 • 44 new comments
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel
#21716 commented on Aug 9, 2025 • 39 new comments
v1: Add Whisper model support (encoder-decoder)
#21088 commented on Aug 10, 2025 • 19 new comments
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on Aug 9, 2025 • 17 new comments
[V1] Logits processors extensibility
#19912 commented on Aug 7, 2025 • 14 new comments
LFM2
#20797 commented on Aug 9, 2025 • 13 new comments
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods.
#22035 commented on Aug 10, 2025 • 10 new comments
Update PyTorch to 2.8.0
#20358 commented on Aug 10, 2025 • 9 new comments
[Feature] use --eplb_config to set eplb param
#20562 commented on Aug 9, 2025 • 9 new comments
[Bugfix] Fix hermes tool parser handling of non-string argument types
#22002 commented on Aug 6, 2025 • 8 new comments
Support token_type_ids in V1 with less code changes
#21985 commented on Aug 10, 2025 • 8 new comments
[Model] Mamba2 varlen and metadata refactor
#21467 commented on Aug 4, 2025 • 8 new comments
[Speculators][Speculative Decoding] Add Eagle3 Support For HunYuan Model
#22080 commented on Aug 9, 2025 • 7 new comments
feat: update flashinfer ar oneshot params
#22108 commented on Aug 10, 2025 • 6 new comments
[Doc]: improve CPU(x86) build instructions and fix include path
#19156 commented on Aug 10, 2025 • 6 new comments
v1: Add Request.block_hashes
#19728 commented on Aug 10, 2025 • 6 new comments
[V1] feat:add engine v1 tracing
#20372 commented on Aug 7, 2025 • 5 new comments
[Feature] limit thinking tokens
#20859 commented on Aug 5, 2025 • 5 new comments
fix(completion): always include usage
#20983 commented on Aug 6, 2025 • 5 new comments
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer
#20059 commented on Aug 10, 2025 • 5 new comments
[Bugfix] Add Dense module support for sentence-transformers models
#22117 commented on Aug 5, 2025 • 4 new comments
Fix error message for max_input_length (bugfix of #22092)
#22094 commented on Aug 5, 2025 • 4 new comments
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert"
#21888 commented on Aug 7, 2025 • 4 new comments
[Feature] [V1] intermediate logging
#21215 commented on Aug 8, 2025 • 3 new comments
fix: NIXL connector transfers partial block to pass full multi-modal context
#21074 commented on Aug 8, 2025 • 3 new comments
[ROCm] Auto-Select Attention Backend
#21366 commented on Aug 7, 2025 • 3 new comments
Add support for model signature verification
#21957 commented on Aug 7, 2025 • 3 new comments
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on Aug 10, 2025 • 3 new comments
Migrate MiniCPMVImageInputs to TensorSchema
#21939 commented on Aug 10, 2025 • 3 new comments
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2
#21783 commented on Aug 8, 2025 • 2 new comments
Draft: Qwen2.5 VL eagle3
#22029 commented on Aug 10, 2025 • 2 new comments
[V1][Hybrid] Make KV cache layout of triton_attn compatible with hybrid models
#21624 commented on Aug 5, 2025 • 2 new comments
[1/N] Refactor platform API to reduce `torch.cuda` call
#20751 commented on Aug 10, 2025 • 2 new comments
[docs] Add FlashInfer installation guidance with torch 2.7 + cuda 12.8
#21890 commented on Aug 5, 2025 • 1 new comment
[Core] Hidden State Processors via plugins
#21621 commented on Aug 7, 2025 • 1 new comment
Migrate LlavaNextVideoPixelInputs to TensorSchema
#21843 commented on Aug 10, 2025 • 1 new comment
v1: Support KV events from connectors
#19737 commented on Aug 10, 2025 • 1 new comment
Migrate MiniMaxVL01ImageInputs to TensorSchema
#21940 commented on Aug 7, 2025 • 1 new comment
Add Tool Call Parser for tngtech/DeepSeek-TNG-R1T2-Chimera
#22074 commented on Aug 4, 2025 • 1 new comment
Qwen FP8 ModelOPT support
#21978 commented on Aug 8, 2025 • 1 new comment
[V1] support min_tokens for detokener
#22014 commented on Aug 6, 2025 • 1 new comment
[Bugfix]: Fix Promethus spec decode counter sum-of-sums
#15415 commented on Aug 10, 2025 • 0 new comments
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#16096 commented on Aug 9, 2025 • 0 new comments
Switch from input to total toks/s in LLM est speed
#12908 commented on Aug 8, 2025 • 0 new comments
[CI/Build] Add support for Python 3.13
#13164 commented on Aug 10, 2025 • 0 new comments
Support R-KV Cache Compression in vLLM
#16160 commented on Aug 8, 2025 • 0 new comments
[P/D Disaggregation] `PDController` and `PDWorker` Prototype (1p1d)
#15343 commented on Aug 4, 2025 • 0 new comments
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on Aug 10, 2025 • 0 new comments
[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)
#12802 commented on Aug 10, 2025 • 0 new comments
[Bugfix] Mistral tool parser streaming update
#19425 commented on Aug 8, 2025 • 0 new comments
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on Aug 7, 2025 • 0 new comments
[Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate` and `LLM.embed`
#18800 commented on Aug 4, 2025 • 0 new comments
[Bugfix] fix check kv cache memory log info
#17959 commented on Aug 10, 2025 • 0 new comments
[Misc][RFC] Add automated profiling sweep and heatmap visualization tools
#17933 commented on Aug 10, 2025 • 0 new comments
measure peak memory correctly by removing already used memory
#17872 commented on Aug 10, 2025 • 0 new comments
Fix NoFreeBlocksError
#17834 commented on Aug 8, 2025 • 0 new comments
cmake: Get rid of VLLM_PYTHON_EXECUTABLE
#17830 commented on Aug 10, 2025 • 0 new comments
[misc] helper for observability config
#17809 commented on Aug 10, 2025 • 0 new comments
Allow MambaCacheManager to use device types other than CUDA
#17779 commented on Aug 8, 2025 • 0 new comments
[Security] Document StatelessProcessGroup security concerns
#17591 commented on Aug 5, 2025 • 0 new comments
[Misc] add get kv cache token capacity
#17538 commented on Aug 7, 2025 • 0 new comments
[BUG] fix asymmetric `add_num_batched_tokens ` and `subtract_num_batched_tokens`
#17436 commented on Aug 5, 2025 • 0 new comments
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 commented on Aug 5, 2025 • 0 new comments
[Frontend] Added support for HermesToolParser for models without special tokens
#16890 commented on Aug 4, 2025 • 0 new comments
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on Aug 7, 2025 • 0 new comments
[V1][Spec Decode] Add random seed for EAGLE and its test script
#16235 commented on Aug 8, 2025 • 0 new comments
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on Aug 10, 2025 • 0 new comments
[RFC]: Add automated profiling sweep and heatmap visualization tools
#17823 commented on Aug 10, 2025 • 0 new comments
[Feature]: Add Native Daemon Mode Support for `vllm serve`
#17847 commented on Aug 10, 2025 • 0 new comments
[Bug]: Token usage is unavailable in Qwen3 streaming thought mode.
#17942 commented on Aug 10, 2025 • 0 new comments
[Feature]: hope that xgrammar and vLLM v1 can offer significant inference acceleration on the RTX 4090 as well
#18517 commented on Aug 10, 2025 • 0 new comments
[Bug]: ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
#17569 commented on Aug 9, 2025 • 0 new comments
[Bug]: Error of running examples/offline_inference/multilora_inference.py using vllm v0.8.3
#16576 commented on Aug 9, 2025 • 0 new comments
[Bug]: [Performance] 100% performance drop using multiple lora vs no lora(qwen-chat model)
#9496 commented on Aug 9, 2025 • 0 new comments
[Bug]: vllm.core.block.interfaces.BlockAllocator.NoFreeBlocksError to old Mistral Model
#11168 commented on Aug 9, 2025 • 0 new comments
[Bug]: GPU OOM when processing AWQ Marlin weights with UVA CPU Offload
#21864 commented on Aug 9, 2025 • 0 new comments
[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
#8933 commented on Aug 9, 2025 • 0 new comments
[Performance]: Unexpected: B200 GPU Performance Similar to H200 for Qwen/QwQ-32B, Expected B200 to be Significantly Faster
#18725 commented on Aug 9, 2025 • 0 new comments
[Bug]: Incremental detokenization error when running `llama-3.3-70b-fp8` model
#21951 commented on Aug 8, 2025 • 0 new comments
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on Aug 8, 2025 • 0 new comments
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on Aug 8, 2025 • 0 new comments
[RFC]: vLLM configuration refactoring and modularization
#18953 commented on Aug 8, 2025 • 0 new comments
[Bug]: Stuck request and empty streaming for gemma3 serving with ^v0.8.5
#17658 commented on Aug 8, 2025 • 0 new comments
[Bug]: Inference fails on Apple silicon due to (distributed) networking error?
#18362 commented on Aug 8, 2025 • 0 new comments
[Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al.
#20060 commented on Aug 8, 2025 • 0 new comments
[Bug]: Gemma3 reporting low image accuracy with v1 engine
#19763 commented on Aug 8, 2025 • 0 new comments
[Bug]: sm75 can not serve qwen3 bnb 4bit model
#17337 commented on Aug 8, 2025 • 0 new comments
[Bug]: Prefix caching ignores visual input, causing incorrect multimodal outputs under concurrency
#20261 commented on Aug 8, 2025 • 0 new comments
[CI] Fix flaky CI test
#12626 commented on Aug 5, 2025 • 0 new comments
[Frontend] Add segments to OpenAI Requests
#11713 commented on Aug 4, 2025 • 0 new comments
[Feature]: Add LoRA adapter support for Gemma3nForConditionalGeneration models
#21746 commented on Aug 10, 2025 • 0 new comments
[Feature]: Support Anthropic API `/v1/messages` endpoint
#21313 commented on Aug 10, 2025 • 0 new comments
[Bug]: Unexpected behavior of `returned_token_ids` in Reward Modeling for LlamaForCausalLM
#16545 commented on Aug 10, 2025 • 0 new comments
[Feature]: Improve Logging for Error Messages
#14083 commented on Aug 10, 2025 • 0 new comments
[Installation]: with latest vllm source code installation done, but failed to run vllm server
#22008 commented on Aug 10, 2025 • 0 new comments
[Feature]: Support for diffusion LLM models like LLADA
#18532 commented on Aug 10, 2025 • 0 new comments
[Bug]: the throughput of qwen3moe is low for prompts above 2000 tokens
#17650 commented on Aug 10, 2025 • 0 new comments
[Feature]: Use `QuantFp8` `CustomOp`-abstraction for MoE layers
#20711 commented on Aug 10, 2025 • 0 new comments
[Performance]: Inefficient prefill attention compared to HuggingFace
#20174 commented on Aug 10, 2025 • 0 new comments
[Bug]: The value of --max-model-len may influence results although the length of input less than max-model-len
#11447 commented on Aug 10, 2025 • 0 new comments
[Bug]: v0.7.3 can't work on wsl-ubuntu mirrored network
#13656 commented on Aug 10, 2025 • 0 new comments
[New Model]: Google SigLip 2
#13663 commented on Aug 10, 2025 • 0 new comments
[Installation]: Attempting to build and run vLLM for Intel Core Ultra 7 155H with ARC iGPU
#14295 commented on Aug 10, 2025 • 0 new comments
[Bug]: 0.8.0(V1) Ray cannot find model pyarrow and pandas
#15100 commented on Aug 10, 2025 • 0 new comments
[Usage]: How to make the Reasoning of deepseek output normally and the final content structured output
#15618 commented on Aug 10, 2025 • 0 new comments
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on Aug 10, 2025 • 0 new comments
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on Aug 10, 2025 • 0 new comments
[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0
#17639 commented on Aug 10, 2025 • 0 new comments
[Bug]: KeyError: 'layers.11.shared_transformer.self_attn.qkv_proj.weight' for Zamba2 after finetuning
#17755 commented on Aug 10, 2025 • 0 new comments
Migrate Mistral3ImagePixelInputs to TensorSchema
#21945 commented on Aug 7, 2025 • 0 new comments
[Bugfix] Ensure the system ulimit is high enough to support the concurrency in benchmark
#21938 commented on Aug 4, 2025 • 0 new comments
Fix #21840. Fix tool_call parsing edge cases
#21930 commented on Aug 4, 2025 • 0 new comments
[Bugfix] Fix port conflict by obtaining a list of open ports upfront
#21894 commented on Aug 8, 2025 • 0 new comments
[Bugfix] Fix PyNcclCommunicator device assertion for un-indexed CUDA devices
#21869 commented on Aug 5, 2025 • 0 new comments
[V0 Deprecation] [P/D] Move `kv_connector/v1` to `kv_connector` (2/2)
#21855 commented on Aug 5, 2025 • 0 new comments
[WIP] Add Kimi-Audio integration for vLLM
#21849 commented on Aug 8, 2025 • 0 new comments
Migrate MiniCPMOAudioInputs to TensorSchema
#21847 commented on Aug 7, 2025 • 0 new comments
Migrate LlavaOnevisionMultiInputs to TensorSchema
#21844 commented on Aug 9, 2025 • 0 new comments
Fix kvcache mismatch issue in vllm v0 kv_connector
#21817 commented on Aug 5, 2025 • 0 new comments
[Bugfix] Scheduler: only schedule prefill chunks when entire context fits
#21809 commented on Aug 7, 2025 • 0 new comments
Migrate LlavaImageInputs to TensorSchema
#21770 commented on Aug 10, 2025 • 0 new comments
[Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models
#21737 commented on Aug 7, 2025 • 0 new comments
[EP]dynamic Eplb metrics
#21732 commented on Aug 4, 2025 • 0 new comments
Limit concurrent long partial prefills via max_long_partial_prefills
#21651 commented on Aug 8, 2025 • 0 new comments
[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend
#21549 commented on Aug 10, 2025 • 0 new comments
[Perf] Support silu_and_mul vectorization
#21521 commented on Aug 4, 2025 • 0 new comments
[Benchmark] Add expert parallel support to MoE benchmark
#20876 commented on Aug 4, 2025 • 0 new comments
[V0 Deprecation] Remove multi-step scheduling
#22138 commented on Aug 9, 2025 • 0 new comments
Update rms_norm_kernel by removing redundant global memory loads
#22134 commented on Aug 10, 2025 • 0 new comments
[Bugfix] Add num_special_tokens_to_add to MistralTokenizer, fixes #22013
#22121 commented on Aug 4, 2025 • 0 new comments
vLLM Benchmark suite improvement
#22119 commented on Aug 9, 2025 • 0 new comments
[Fix] Fix python path resolving in cpu cmake
#22115 commented on Aug 6, 2025 • 0 new comments
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar
#22112 commented on Aug 6, 2025 • 0 new comments
enable Docker-aware precompiled wheel setup
#22106 commented on Aug 6, 2025 • 0 new comments
Enable EPLB on ernie4.5-moe
#22100 commented on Aug 5, 2025 • 0 new comments
[V1] Enhanced Exception Handling for KV Cache Loading from Remote Store
#22075 commented on Aug 4, 2025 • 0 new comments
[Core] Enable HF processing on GPU
#22070 commented on Aug 9, 2025 • 0 new comments
Update FlashAttention to latest commit
#22030 commented on Aug 7, 2025 • 0 new comments
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm
#22017 commented on Aug 9, 2025 • 0 new comments
[Structured Output][Refactor] Move `apply_grammar_bitmask()` method from `ModelRunner` to structured output utils
#21999 commented on Aug 8, 2025 • 0 new comments
[Disagg][Perf] Add env var to allow gpu model work runs in non-default CUDA stream, improving disagg TTIT/TTFT
#21988 commented on Aug 7, 2025 • 0 new comments
[torchao] Support quantization configs using module swap
#21982 commented on Aug 9, 2025 • 0 new comments
[CI/Build] add EP dependencies to docker
#21976 commented on Aug 7, 2025 • 0 new comments
[Build] Add FlashInfer wheel build to release pipeline
#21975 commented on Aug 6, 2025 • 0 new comments
[Feature] Add `VLLM_USE_DEEP_GEMM_E8M0` Env to Control E8M0 Scale
#21968 commented on Aug 10, 2025 • 0 new comments
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on Aug 7, 2025 • 0 new comments
[Bugfix] Fix the bug in Hermes streaming parsing
#20824 commented on Aug 5, 2025 • 0 new comments
[WIP] [Feature]: LoRA for vision modules
#20787 commented on Aug 4, 2025 • 0 new comments
[PERF] Symmetric memory allreduce
#20759 commented on Aug 7, 2025 • 0 new comments
[feat] backup memory except model parameters when using level=2 in sleep mode
#20735 commented on Aug 4, 2025 • 0 new comments
PrefixRepetitionRandomDataset
#20638 commented on Aug 8, 2025 • 0 new comments
feat: Add streaming support for Mistral v11 tool format
#20503 commented on Aug 6, 2025 • 0 new comments
[Installation] Fix python only installation wheel packaging missing libs
#20351 commented on Aug 9, 2025 • 0 new comments
[Do not merge] Add out of place layernorm
#20197 commented on Aug 7, 2025 • 0 new comments
v1: Introduce LRU-based CPU offloading management
#20075 commented on Aug 5, 2025 • 0 new comments
Add support for token_type_ids
#19988 commented on Aug 5, 2025 • 0 new comments
[Core] Track expert selection metrics
#19915 commented on Aug 4, 2025 • 0 new comments
v1: Introduce an offloading component
#19848 commented on Aug 4, 2025 • 0 new comments
Triton-fused DeepseekScalingRotaryEmbedding
#19771 commented on Aug 8, 2025 • 0 new comments
Workaround for an integer overflow with large CHUNK_SIZE
#19770 commented on Aug 8, 2025 • 0 new comments
BLOCK_SIZE_K fix
#19769 commented on Aug 8, 2025 • 0 new comments
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 commented on Aug 8, 2025 • 0 new comments
[V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1
#21506 commented on Aug 10, 2025 • 0 new comments
[v1][spec decode] Run eagle with full cudagraph support
#21477 commented on Aug 10, 2025 • 0 new comments
v1/offloading: Add worker-side CPU support
#21448 commented on Aug 6, 2025 • 0 new comments
Updates to Flex + VLLm integration
#21416 commented on Aug 8, 2025 • 0 new comments
[wip] add nccl allocator and symm memory and enable TP all reduce for nccl symm
#21383 commented on Aug 5, 2025 • 0 new comments
[feat] Support EAGLE for Qwen2
#21363 commented on Aug 4, 2025 • 0 new comments
[Feature][EPLB] Add support for Qwen3 EPLB
#21290 commented on Aug 5, 2025 • 0 new comments
[Fix] correct tool_id for kimi-k2 when use tool_choice=required
#21259 commented on Aug 7, 2025 • 0 new comments
Make async scheduling compatible with DP
#21244 commented on Aug 8, 2025 • 0 new comments
ci: Add CUDA + arm64 release builds
#21201 commented on Aug 6, 2025 • 0 new comments
[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel
#21197 commented on Aug 4, 2025 • 0 new comments
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on Aug 7, 2025 • 0 new comments
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing
#21161 commented on Aug 4, 2025 • 0 new comments
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on Aug 7, 2025 • 0 new comments
Enable sequence parallelism for full cuda graph without specifying compile sizes
#21031 commented on Aug 10, 2025 • 0 new comments
[Meta] Unshift eagle prefill and support draft kv sharing from base
#21008 commented on Aug 8, 2025 • 0 new comments
fix: Handle unsupported message fields in tool calling
#20973 commented on Aug 4, 2025 • 0 new comments
Add add_logger API to AsyncLLM
#20952 commented on Aug 8, 2025 • 0 new comments
[Bug]: Large Data Parallel Size Cause Loading Safetensors Extremely Slow
#17783 commented on Aug 6, 2025 • 0 new comments
[Usage]: How to reproduce the results of `vllm` using `transformers`
#21433 commented on Aug 5, 2025 • 0 new comments
[Bug]: ERNIE-4.5 does not run on an RTX Pro 6000 Blackwell
#20712 commented on Aug 5, 2025 • 0 new comments
[Bug]: When inferring Qwen3-32B-AWQ with vllm0.9.2, an error message appears: Quantization scheme is not supported
#20216 commented on Aug 5, 2025 • 0 new comments
[Feature]: return graceful inference text input validation errors as part of output (without throwing an exception) - to enable skipping / handling bad examples after the processing of good ones
#16732 commented on Aug 5, 2025 • 0 new comments
[Feature]: try to gracefully destroy process group in `vllm serve` on handling Ctrl+C (prior to processes termination)
#19196 commented on Aug 5, 2025 • 0 new comments
[Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8)
#21466 commented on Aug 5, 2025 • 0 new comments
[Installation]: Docker image build fails for Apple Silicon using Dockerfile.cpu
#21714 commented on Aug 5, 2025 • 0 new comments
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on Aug 5, 2025 • 0 new comments
[RFC] Run HF processing on GPU
#21995 commented on Aug 5, 2025 • 0 new comments
[Bug]: GLM-4.5 not working
#22140 commented on Aug 5, 2025 • 0 new comments
[Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4
#20468 commented on Aug 5, 2025 • 0 new comments
[Usage]: 使用Vllm如何对Qwen3ForSequenceClassification模型进行文本分类加速？
#19950 commented on Aug 5, 2025 • 0 new comments
[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model.
#21986 commented on Aug 5, 2025 • 0 new comments
[Feature]: LoRA support for qwen2-vl Models
#11255 commented on Aug 5, 2025 • 0 new comments
[Feature]: QTIP Quantization
#11416 commented on Aug 5, 2025 • 0 new comments
[Usage]: Automatic Prefix Cache life cycle
#12077 commented on Aug 5, 2025 • 0 new comments
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on Aug 5, 2025 • 0 new comments
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 commented on Aug 5, 2025 • 0 new comments
[Usage]: Segmentation Fault caused by model indexing errors (token sequence length exceeding 16384) in vLLM 0.7.3 multi-node deployment for DeepSeek R1 67B
#14652 commented on Aug 5, 2025 • 0 new comments
[Bug]: v0.8.1 V1 with pipeline-parallel-size 4, weird responses
#16068 commented on Aug 5, 2025 • 0 new comments
[Bug]: Problems with vllm serve DeepSeek-R1 with 2 nodes and TP = 16（include vllm v0.8.4 v0.7.3 v0.7.2 V0 V1 engine）
#16692 commented on Aug 5, 2025 • 0 new comments
[Doc]: state requirements for testing or update to work for CPU-only
#16920 commented on Aug 5, 2025 • 0 new comments
[Usage]: Is it possible to use CUDA Graph during the encoding for encoder-decoder models?
#17789 commented on Aug 6, 2025 • 0 new comments
[Usage]: 自己部署vllm，无法调用工具，需要开启--enable-auto-tool-choice，开启后提示要配置--chat-template-content-format，最后报错
#17792 commented on Aug 6, 2025 • 0 new comments
[Usage]: How to output metrics information from vllm?
#17795 commented on Aug 6, 2025 • 0 new comments
[Usage]: how to return attention_weight logits in page_attention
#17796 commented on Aug 6, 2025 • 0 new comments
[Installation]: How to deploy docling model on vllm
#17807 commented on Aug 6, 2025 • 0 new comments
[Bug]: Disaggregated Prefill in vLLM 0.8.3 Produces Incorrect/Unreasonable Outputs
#17808 commented on Aug 6, 2025 • 0 new comments
[Usage]: Deploy EasyOCR , Docling models on vllm
#17814 commented on Aug 6, 2025 • 0 new comments
[Bug]: vllm 0.8.5.dev468+g98834fefa.precompiled OOM on Qwen3-32B with 1 lora module
#17822 commented on Aug 6, 2025 • 0 new comments
[Bug]: Enabling EPLB leads to inconsistent inference results
#21606 commented on Aug 6, 2025 • 0 new comments
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on Aug 6, 2025 • 0 new comments
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 commented on Aug 5, 2025 • 0 new comments
[Bug]: Dynamic loading LoRA is not working properly
#18372 commented on Aug 5, 2025 • 0 new comments
[Bug]: tool call parameters doesn't respect schema for parameters when Streaming=True and tool_choice="auto"
#21756 commented on Aug 5, 2025 • 0 new comments
[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash
#17924 commented on Aug 5, 2025 • 0 new comments
[Feature]: [P/D] Expose kv_transfer metrics (print to console, and to promethus)
#21784 commented on Aug 5, 2025 • 0 new comments
[Bug]: The class UnquantizedLinearMethod must implement the 'embedding' method
#22111 commented on Aug 5, 2025 • 0 new comments
[Bug]: When running phi-4-reasoning-plus with vLLM, the model gets stuck repeating reasoning phrases
#18141 commented on Aug 5, 2025 • 0 new comments
[RFC]: Optimize Input Media Processing in vLLM
#22044 commented on Aug 5, 2025 • 0 new comments
[Bug]: Tool call argument value of type `integer` may break things when `stream=True`
#21372 commented on Aug 5, 2025 • 0 new comments
[Usage]: Vllm whisper model response_format verbose_json not working
#14818 commented on Aug 5, 2025 • 0 new comments
[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero
#20402 commented on Aug 5, 2025 • 0 new comments
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#21661 commented on Aug 5, 2025 • 0 new comments
[Bug]: swap_blocks and copy_blocks functions are wrong in flashinfer.py
#17362 commented on Aug 5, 2025 • 0 new comments
[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs
#15338 commented on Aug 4, 2025 • 0 new comments
[Bug]: Facing run time error after building vllm cpu from source
#21935 commented on Aug 4, 2025 • 0 new comments
[Bug]: ValueError: There is no module or parameter named 'lm_head' in Gemma3nForConditionalGeneration
#21755 commented on Aug 4, 2025 • 0 new comments
[Bug]: Single-Node EP Inference Failure on DeepSeek with PPLX/DeepGEMM Backend
#22039 commented on Aug 4, 2025 • 0 new comments
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on Aug 4, 2025 • 0 new comments
[Bug]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_0': 1}
#21882 commented on Aug 4, 2025 • 0 new comments
[Bug]: An error occurred when using Eagle3 to load the Qwen3 series.
#22152 commented on Aug 4, 2025 • 0 new comments
[Bug]: Processor mismatch between what is provided by OpenGVLab and VLLM for InternVL leading to outputs of the processor being too large to be decoded for the tokenizer
#21899 commented on Aug 4, 2025 • 0 new comments
[Bug]: Qwen2.5-VL + LoRA returns different results for same input on H20
#22057 commented on Aug 4, 2025 • 0 new comments
[Feature]: Multimodal Benchmarking Support (MMLM)
#21887 commented on Aug 4, 2025 • 0 new comments
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on Aug 4, 2025 • 0 new comments
[Bug]: vllm.LLM does not seem to re-initialize for distributed inference with subsequent models with Offline Inference
#9727 commented on Aug 4, 2025 • 0 new comments
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 commented on Aug 4, 2025 • 0 new comments
[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made
#15483 commented on Aug 4, 2025 • 0 new comments
[Bug]: Use the latest version of the inference model and use API calls to report errors.（V0.8.5）
#17430 commented on Aug 4, 2025 • 0 new comments
[Bug]: VisionArena Benchmark for Vision Language Models (with `benchmark_serving.py`) crashes with `Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Forbidden`
#17489 commented on Aug 4, 2025 • 0 new comments
[Bug]: failed to run LMCache example for v0
#17545 commented on Aug 4, 2025 • 0 new comments
[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request.
#17609 commented on Aug 4, 2025 • 0 new comments
[Bug]: Qwen2.5-vl-7B stuck after loading weight and use a lot of shared GPU memory
#17611 commented on Aug 4, 2025 • 0 new comments
[Feature]: How to enable an LLM to simultaneously provide OpenAI API-compatible /v1/completions and /v1/embeddings services
#17627 commented on Aug 4, 2025 • 0 new comments
[Usage]: vLLM on multiple node GPUs
#17645 commented on Aug 4, 2025 • 0 new comments
[Feature]: Support for streaming N tokens at a time in AsyncLLMEngine
#17681 commented on Aug 4, 2025 • 0 new comments
[Bug]: A800 GPU set VLLM_USE_V1=1 ValueError: No available memory for the cache blocks
#17431 commented on Aug 5, 2025 • 0 new comments
[Usage]: support HTTP/2.0?
#17695 commented on Aug 5, 2025 • 0 new comments
[Bug]: Required fields Qwen2-VL missing "pixel_values"
#17696 commented on Aug 5, 2025 • 0 new comments
[Feature]: Addition of pre-built AMD wheel packages
#17697 commented on Aug 5, 2025 • 0 new comments
[Usage]: How to limit the thinking budget for reasoning mode
#17700 commented on Aug 5, 2025 • 0 new comments
[Feature]: The v1 engine does not support `add_logger`.
#17702 commented on Aug 5, 2025 • 0 new comments
[Bug]: Qwen3-30B-A3B-FP8 fails to run on 2*3090
#17708 commented on Aug 5, 2025 • 0 new comments
[Bug]: Slight Embedding Precision Difference When Running bge-m3 in vLLM Compared to Original Model
#17713 commented on Aug 5, 2025 • 0 new comments
[RFC]: Enabling Arm Neoverse CI Runners
#17720 commented on Aug 5, 2025 • 0 new comments
[Feature]: Does vLLM allow 'dropping' requests instead of preempting them?
#17736 commented on Aug 5, 2025 • 0 new comments
[Bug]: Interrupting inference with ctrl-c causes future requests to hang
#17738 commented on Aug 5, 2025 • 0 new comments
[Bug]: token_type_ids lost from prompt input during asynchronous request processing
#17743 commented on Aug 5, 2025 • 0 new comments
[Bug]: vllm==0.10.0 + flashinfer, MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
#21822 commented on Aug 5, 2025 • 0 new comments
[Bug]: Mistrall Small 3.2 doesn't work with images
#20025 commented on Aug 5, 2025 • 0 new comments
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on Aug 5, 2025 • 0 new comments
[Feature]: Audit and Update Examples To Use `VLLM_USE_V1=1`
#14530 commented on Aug 5, 2025 • 0 new comments
[Bug]: [v1/core/block_pool.py] Assertion Failure: prev_block.block_hash is not None
#21992 commented on Aug 4, 2025 • 0 new comments
[RFC]: Multi-modality Support on vLLM
#4194 commented on Aug 4, 2025 • 0 new comments
[Usage]: Kubernetes Offline Model Usage
#22071 commented on Aug 4, 2025 • 0 new comments
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on Aug 4, 2025 • 0 new comments
[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends
#21805 commented on Aug 4, 2025 • 0 new comments
[Bug]: Verify that the `min_tokens` sampling parameter is working and covered by CI tests
#21950 commented on Aug 4, 2025 • 0 new comments
[Bug]: Error when loading EAGLE3 weight, yuhuili/ EAGLE3-LLaMA3.1-Instruct-8B
#19991 commented on Aug 8, 2025 • 0 new comments
[Bug]: Offline inference data parallel significantly slower in 0.8.2 than 0.6.4.post1 and 0.7.2
#17685 commented on Aug 8, 2025 • 0 new comments
[Bug]: Qwen3 30b a3b awq not working with vllm docker v0.8.5.post1
#17739 commented on Aug 8, 2025 • 0 new comments
[Bug]: when vll send a low pictures, will be broken.
#17769 commented on Aug 8, 2025 • 0 new comments
[Usage]: Inquiry About AMD APU Support (e.g., AMD AI Max+ 395) and Handling in vLLM
#17843 commented on Aug 8, 2025 • 0 new comments
[Bug]: Can't use GPTQ model with weight_zero_point
#17862 commented on Aug 8, 2025 • 0 new comments
[Feature]: Adding attention mask to vllm.attention.Attention
#17869 commented on Aug 8, 2025 • 0 new comments
[Bug]: V1 on AMD MI300A complains that cupy is not present
#17875 commented on Aug 8, 2025 • 0 new comments
[Bug]: vllm run bge-m3 error
#17877 commented on Aug 8, 2025 • 0 new comments
[Bug]: object has no attribute 'finished_req_ids'
#17881 commented on Aug 8, 2025 • 0 new comments
[Bug]: Model's test performance degrade significantly when I use vllm to deploy it with a concurrency number exceeding 5
#17886 commented on Aug 8, 2025 • 0 new comments
[Bug]: Qwen/QwQ-32B crashed，but sglang work good! occurs an illegal memory access was encountered
#17889 commented on Aug 8, 2025 • 0 new comments
[Bug]: Regarding the CUDA error/no_thinking that occurred during pressure testing
#17893 commented on Aug 8, 2025 • 0 new comments
[Feature]: Tensor parallelism for GLM-4.5
#22126 commented on Aug 8, 2025 • 0 new comments
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on Aug 7, 2025 • 0 new comments
[Bug]: v0.10.0 built with early version of pytorch that does not support sm-120
#21633 commented on Aug 7, 2025 • 0 new comments
[Feature] Skip modules for disabled modalities
#21943 commented on Aug 7, 2025 • 0 new comments
[RFC]: KV cache offloading
#19854 commented on Aug 7, 2025 • 0 new comments
[Bug]: Potential Integer Overflow permute_cols.cu
#19450 commented on Aug 7, 2025 • 0 new comments
[Usage]: Can vllm skip MTP layer loading for GLM-4.5 to save some vram
#22120 commented on Aug 7, 2025 • 0 new comments
[Bug]: disaggregated prefilling hangs when TP=2
#11247 commented on Aug 7, 2025 • 0 new comments
[Usage]: Triton compilation error (f16 to f16 conversion) on Tesla T4 with Qwen2.5-0.5B-Instruct and LoRA
#20259 commented on Aug 7, 2025 • 0 new comments
[Bug]: 'NoneType' object has no attribute 'sampled_token_ids' for DP 2 PP 2
#22062 commented on Aug 7, 2025 • 0 new comments
[Usage]: When I use the Qwen3-32B with tool_choice='required' parameter, the tool calling gets stuck in a loop
#21026 commented on Aug 8, 2025 • 0 new comments
[Usage]: [0.8.5v1+1P1D+LMCACHE] Is the Prefill instance running queue limited to processing only one request?
#18952 commented on Aug 8, 2025 • 0 new comments
[Feature]: Add Triton implementation of NVFP4 GEMM
#21014 commented on Aug 8, 2025 • 0 new comments
[Bug]: Fix MRL Support Detection for Qwen3-Embedding-8B Model (It Supports MRL per Latest Official Docs)
#20899 commented on Aug 8, 2025 • 0 new comments
[Feature]: Dynamic Chunked Pipeline Parallelism
#20808 commented on Aug 8, 2025 • 0 new comments
[Bug]: TTFT increased especially in some Distill Models with small BatchSize in v0.10.0 compared to v0.9.2
#21983 commented on Aug 8, 2025 • 0 new comments
[Usage] Qwen3 Usage Guide
#17327 commented on Aug 8, 2025 • 0 new comments
[Bug]: Empty VllmConfig when calling `get_current_vllm_config`, causing VllmConfig `__post__init__` to fail
#21134 commented on Aug 8, 2025 • 0 new comments
[Feature]: Support structured output and tool call together
#16313 commented on Aug 8, 2025 • 0 new comments
[Bug]: vLLM Server Crash with CUDA Memory Error when serving `gemma-3-27b-it-FP8-Dynamic`
#21708 commented on Aug 8, 2025 • 0 new comments
[Bug]: vllm.third_party.pynvml.NVMLError_InvalidArgument: Invalid Argument
#19071 commented on Aug 8, 2025 • 0 new comments
[Feature]: will whisper add language detection?
#14174 commented on Aug 8, 2025 • 0 new comments
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 commented on Aug 8, 2025 • 0 new comments
[Bug]: Qwen2 MoE inference is super slow
#15470 commented on Aug 8, 2025 • 0 new comments
[Bug]: CPU offload not working for vllm serve
#15877 commented on Aug 8, 2025 • 0 new comments
[Bug]: Is V1 Enigne ready for DeepSeek-V1/R1 ?
#16442 commented on Aug 8, 2025 • 0 new comments
[Bug]: RuntimeError: operator _C::machete_gemm does not exist
#16810 commented on Aug 8, 2025 • 0 new comments
[Bug]: DataParallel on multinode unable to start GPU
#16957 commented on Aug 8, 2025 • 0 new comments
[Bug]: raise NotImplementedError
#17086 commented on Aug 8, 2025 • 0 new comments
[Bug]: Potential memory leak: VRAM continuously increases and not freed with deepseek-r1 on vLLM v1 engine
#17243 commented on Aug 8, 2025 • 0 new comments
[Usage]: CUDA Error with Qwen3-32B Model When Processing larger tokens it leads to model went to non responsive condition / stability concerns
#17534 commented on Aug 8, 2025 • 0 new comments
[Feature]: Implement vAttention: Virtual Memory Management for KV Cache on NVIDIA GPUs
#17612 commented on Aug 8, 2025 • 0 new comments
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on Aug 7, 2025 • 0 new comments
[Bug]: min_tokens is not respected when stop is triggered early
#21987 commented on Aug 6, 2025 • 0 new comments
[Usage]: [V1] Misleading Error Messages
#13510 commented on Aug 6, 2025 • 0 new comments
[Frontend] Combine microbatch tokenization with multi-modal processing
#21949 commented on Aug 6, 2025 • 0 new comments
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 commented on Aug 6, 2025 • 0 new comments
[Feature]: Support Inflight quantization: load as 8bit quantization.
#11655 commented on Aug 6, 2025 • 0 new comments
[Usage]: Can AsyncLLMEngine support batch infer？
#14717 commented on Aug 6, 2025 • 0 new comments
[Bug]: Design flaws in the current tool parser.
#15177 commented on Aug 6, 2025 • 0 new comments
[New Model]: Support for SFR-Embedding-Code-2B_R embbeding model
#15362 commented on Aug 6, 2025 • 0 new comments
[Bug]: RequestMetrics object (accessed through output[0].metrics) is None
#15394 commented on Aug 6, 2025 • 0 new comments
[Performance]: Update Cascade Attention Heuristics for FA3
#15647 commented on Aug 6, 2025 • 0 new comments
[Bug]: H20*TP16，can't start service, get error: Cannot allocate memory
#16142 commented on Aug 6, 2025 • 0 new comments
[Bug]: vLLM still runs after Ray workers crash
#16259 commented on Aug 6, 2025 • 0 new comments
[Feature Request]: Support data_parallel_size in offline inference mode
#16588 commented on Aug 6, 2025 • 0 new comments
[Bug]: [v1][Spec Dec] Specifying draft TP does not have any impact.
#17499 commented on Aug 6, 2025 • 0 new comments
[RFC]: Deprecating vLLM V0
#18571 commented on Aug 6, 2025 • 0 new comments
[Bug]: Can't serve can we serve Q4_K_M-GGUF Model
#17661 commented on Aug 6, 2025 • 0 new comments
[Usage]: Offline multi-node inference
#17711 commented on Aug 6, 2025 • 0 new comments
[Feature]: Support for OpenGVLab/InternVL3-38B-AWQ
#17734 commented on Aug 6, 2025 • 0 new comments
[Feature]: Support quantization for pooling model which does embedding.
#17760 commented on Aug 6, 2025 • 0 new comments
[Usage]: How to Truncate multi-modal tokens
#17765 commented on Aug 6, 2025 • 0 new comments
[Bug]: Logits processing with Lora is incorrect
#17766 commented on Aug 6, 2025 • 0 new comments
[Feature]: Support for IBGDA
#17774 commented on Aug 6, 2025 • 0 new comments
[Feature]: Qwen3 Models GGUF Support
#21511 commented on Aug 7, 2025 • 0 new comments
[Feature]: Attention-FFN disaggregation
#21644 commented on Aug 7, 2025 • 0 new comments
[Bug]: AWQ fails on MoE models
#22004 commented on Aug 7, 2025 • 0 new comments
[Bug]: Failing to initialize engine on qwen3 on B200 with VLLM_USE_DEEP_GEMM=1
#21542 commented on Aug 7, 2025 • 0 new comments
invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’}
#17931 commented on Aug 7, 2025 • 0 new comments
[Bug]: LLaMa 3.1 8B/70B/405B all behave poorly and differently using completions API as compared to good chat API
#7382 commented on Aug 7, 2025 • 0 new comments
[RFC]: Prompt Embeddings Support in v1 Engine
#22124 commented on Aug 6, 2025 • 0 new comments
[Bug]: some error when training in ray
#20431 commented on Aug 6, 2025 • 0 new comments
[Bug]: can not support InternVL3-78B-AWQ
#21695 commented on Aug 6, 2025 • 0 new comments
[Bug]: GLM-4.1V lora trained model reports target_module mismatch error
#22077 commented on Aug 6, 2025 • 0 new comments
[Bug]: OpenReasoning-Nemotron-32B Only Outputs Exclamation Marks Regardless of Input
#21292 commented on Aug 6, 2025 • 0 new comments
[Bug]: can not use uv run or uv run python on mac series
#15985 commented on Aug 6, 2025 • 0 new comments
[Feature]: Any plans to support TokenWeave optimizations in vLLM?
#20223 commented on Aug 6, 2025 • 0 new comments
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#20226 commented on Aug 6, 2025 • 0 new comments
[Bug]: AttributeError: 'OvisConfig' object has no attribute 'num_attention_heads'
#17646 commented on Aug 6, 2025 • 0 new comments
[Usage]: How to implement the inference test of LLM model PD (Prefill-Decode) disaggregation using the vllm framework ?
#21800 commented on Aug 6, 2025 • 0 new comments
Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
#2248 commented on Aug 6, 2025 • 0 new comments
[Bug]: cuda version of `vllm/vllm-openai:latest` older than k8s node cuda 12.9 Incompatibility error
#21979 commented on Aug 6, 2025 • 0 new comments
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 commented on Aug 6, 2025 • 0 new comments
[Bug]: all2all communication hangs when using DeepEP and PPLX for v0.9.2
#21306 commented on Aug 6, 2025 • 0 new comments
[RFC]: Unification of frontend parser
#17817 commented on Aug 6, 2025 • 0 new comments
[Bug]: VLLM 0.10.0 breaks quantized models batch inference speed for Qwen2.5-VL-7B (tested multiple quantization types)
#21689 commented on Aug 6, 2025 • 0 new comments