-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
192 Pull requests merged by 80 people
-
Fix TensorSchema validation test for symbolic dims
#22366 merged
Aug 10, 2025 -
Remove redundant row_indices unsqueeze operation in MiniCPMO
#22528 merged
Aug 10, 2025 -
Migrate LlavaNextImageInputs to TensorSchema
#21774 merged
Aug 10, 2025 -
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks
#22534 merged
Aug 10, 2025 -
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel
#22593 merged
Aug 10, 2025 -
[doc] add alibaba cloud as sponsor
#22597 merged
Aug 10, 2025 -
[doc] add beijing meetup links
#22596 merged
Aug 10, 2025 -
Move
CacheConfig
fromconfig/__init__.py
toconfig/cache.py
#22586 merged
Aug 10, 2025 -
[Misc] Replace flaky image urls in pixtral test
#22574 merged
Aug 10, 2025 -
[Docs] Fix warnings in docs build
#22588 merged
Aug 10, 2025 -
[Misc] Further refine type annotations in parallel state
#22499 merged
Aug 10, 2025 -
[Doc] Fix API doc link in side navigation
#22585 merged
Aug 10, 2025 -
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config
#22566 merged
Aug 10, 2025 -
[Minor] Fix pre-commit error on main
#22579 merged
Aug 10, 2025 -
Refactor sliding window configuration to Transformers best practice
#21927 merged
Aug 10, 2025 -
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block
#22394 merged
Aug 10, 2025 -
Improve fast_topk function with type hints and documentation
#22530 merged
Aug 10, 2025 -
[Config] add "qwen" as a native eagle3 target supported model
#22333 merged
Aug 10, 2025 -
[oss] Init gpt-oss bf16 support
#22508 merged
Aug 10, 2025 -
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers
#21401 merged
Aug 10, 2025 -
[FEAT] [Performance] Add triton mrope to replace the torch code path
#22375 merged
Aug 9, 2025 -
[Bugfix] Fix basic models tests hanging due to mm processor creation
#22571 merged
Aug 9, 2025 -
[Model] Gemma3n MM
#20495 merged
Aug 9, 2025 -
Move
ParallelConfig
fromconfig/__init__.py
toconfig/parallel.py
#22565 merged
Aug 9, 2025 -
[Docs] Reduce noise in docs and
--help
from the JSON tip#22567 merged
Aug 9, 2025 -
[CI] [Hybrid] Speed up hybrid models test by removing large models
#22563 merged
Aug 9, 2025 -
GLM-4.5V with new class name at transformers
#22520 merged
Aug 9, 2025 -
Update docs for Minimax-Text support
#22562 merged
Aug 9, 2025 -
[Bugfix] Fix CI moe kernel failure
#22556 merged
Aug 9, 2025 -
[Bugfix] Fix failing GPT-OSS initialization test
#22557 merged
Aug 9, 2025 -
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel
#22097 merged
Aug 9, 2025 -
[TPU] Add support for online w8a8 quantization
#22425 merged
Aug 9, 2025 -
Fix loading of quantized BigCode models
#22463 merged
Aug 9, 2025 -
[Misc] Use config definitions from Transformers library
#21913 merged
Aug 9, 2025 -
v1: Pass KVConnectorOutput to scheduler-side
#22157 merged
Aug 9, 2025 -
[V1] [Hybrid] Support Minimax-Text-01 in V1
#22151 merged
Aug 9, 2025 -
[Log] Add Warning for Deprecation of DeepGEMM old version
#22194 merged
Aug 9, 2025 -
Remove mamba_ssm from vLLM requirements; install inside test container using
--no-build-isolation
#22541 merged
Aug 9, 2025 -
[Doc] Add usage of implicit text-only mode
#22561 merged
Aug 9, 2025 -
Implicit language-model-only mode via limit-mm-per-prompt
#22299 merged
Aug 9, 2025 -
[Bugfix] Fix ModernBert cuda graph capturing in v1
#21901 merged
Aug 9, 2025 -
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D
#22317 merged
Aug 9, 2025 -
[XPU] upgrade torch 2.8 on for XPU
#22300 merged
Aug 9, 2025 -
Drop flaky test_healthcheck_response_time
#22539 merged
Aug 8, 2025 -
Extract
CompilationConfig
fromconfig.py
#22524 merged
Aug 8, 2025 -
[Frontend] Add unix domain socket support
#18097 merged
Aug 8, 2025 -
[Docs] fix broken links in metrics.md
#22315 merged
Aug 8, 2025 -
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers
#22536 merged
Aug 8, 2025 -
[Bugfix] Update FA commit hash
#22546 merged
Aug 8, 2025 -
[Misc] DeepGEMM : Avoid JIT generation in the hot-path
#22215 merged
Aug 8, 2025 -
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA
#21691 merged
Aug 8, 2025 -
[gpt-oss] Support tool call and implement MCP tool server
#22427 merged
Aug 8, 2025 -
[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling”
#22466 merged
Aug 8, 2025 -
[gpt-oss] guard import when triton kernel is not installed
#22529 merged
Aug 8, 2025 -
[Benchmark] Add benchmark tool for multi turn conversations
#20267 merged
Aug 8, 2025 -
[gpt-oss] triton kernel mxfp4
#22421 merged
Aug 8, 2025 -
Remove exception for Python 3.8 typing from linter
#22506 merged
Aug 8, 2025 -
[Docs] Improve API docs (+small tweaks)
#22459 merged
Aug 8, 2025 -
[BugFix] Don't cancel asyncio tasks directly from destructors
#22476 merged
Aug 8, 2025 -
[Misc] fix openai version
#22485 merged
Aug 8, 2025 -
[Misc] Begin deprecation of
get_tensor_model_*_group
#22494 merged
Aug 8, 2025 -
[CI/Build] Fix multimodal tests
#22491 merged
Aug 8, 2025 -
[bench] Fix benchmark/serve.py to ignore unavailable results
#22382 merged
Aug 8, 2025 -
[Doc] Sleep mode documentation
#22310 merged
Aug 8, 2025 -
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10
#22426 merged
Aug 8, 2025 -
Fix pre-commit
#22487 merged
Aug 8, 2025 -
Optimize MiniCPMO mask creation with vectorized implementation
#22464 merged
Aug 8, 2025 -
not tie_word_embeddings for glm-4.5 and glm-4.5v
#22460 merged
Aug 8, 2025 -
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match
#22065 merged
Aug 8, 2025 -
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000)
#22131 merged
Aug 8, 2025 -
Fix Flashinfer CUTLASS MOE Allgather
#21963 merged
Aug 8, 2025 -
Support Tensorrt-LLM MoE fp4 for low-latency
#21331 merged
Aug 8, 2025 -
Add ModelOpt Qwen3 nvfp4 support
#20101 merged
Aug 8, 2025 -
[PERF] Use pybase64 to more quickly decode prompt embeddings
#22469 merged
Aug 8, 2025 -
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine
#21496 merged
Aug 8, 2025 -
[Misc] normalize multiprocessing Queue usage
#22371 merged
Aug 8, 2025 -
Remove
from_dict
fromSpeculativeConfig
#22451 merged
Aug 7, 2025 -
[Frontend] Use engine argument to control MM cache size
#22441 merged
Aug 7, 2025 -
[Core] Simplify mm processing cache
#22457 merged
Aug 7, 2025 -
Fix pre-commit error in main
#22462 merged
Aug 7, 2025 -
[gpt-oss] Generate ResponseOutputItem from Harmony Message
#22410 merged
Aug 7, 2025 -
[Tool] Fix auto tool call
#22434 merged
Aug 7, 2025 -
[Bugfix] Add missing
packed_modules_mapping
toDeepseekV2ForCausalLM
#22352 merged
Aug 7, 2025 -
[Core] Store only the keys for multi-modal data in P0
#22198 merged
Aug 7, 2025 -
[Docs] Update features/disagg_prefill, add v1 examples and development
#22165 merged
Aug 7, 2025 -
[Doc] update docs for nightly benchmarks
#12022 merged
Aug 7, 2025 -
[Docs] Factor out troubleshooting to its own guide; add section for Ray Observability
#21578 merged
Aug 7, 2025 -
[Doc] Fix link to prefix caching design
#22384 merged
Aug 7, 2025 -
[Misc] Enhance code formatting in mxfp4.py
#22423 merged
Aug 7, 2025 -
Add H20-3e fused MoE kernel tuning configs for GLM-4.5
#22433 merged
Aug 7, 2025 -
[Docs] Add missing dependency for docs build
#22435 merged
Aug 7, 2025 -
feat: Add --enable-log-outputs flag for logging model generations
#20707 merged
Aug 7, 2025 -
[Misc] Support routing logic simulation
#21990 merged
Aug 7, 2025 -
[Frontend] Update OpenAI error response to upstream format
#22099 merged
Aug 7, 2025 -
[Model] Switch to Fused RMS norm in Qwen2.5_VL model.
#22184 merged
Aug 7, 2025 -
[Bench] Split serve.py:main into async/async versions
#22405 merged
Aug 7, 2025 -
[CI] Skip the pooling models that do not support transformers v4.55
#22411 merged
Aug 7, 2025 -
[Bugfix] EPLB load statistics problem
#22167 merged
Aug 7, 2025 -
[gpt-oss] Convert user input to harmony format
#22402 merged
Aug 7, 2025 -
preload heavy modules when mp method is forkserver
#22214 merged
Aug 7, 2025 -
Optimize logger init performance by using module-level constants
#22373 merged
Aug 7, 2025 -
Update
hf_xet
pin to resolve hangs#22356 merged
Aug 7, 2025 -
[Bugfix] Add proper comparison for package versions
#22314 merged
Aug 7, 2025 -
[Bugfix]: Fix the streaming output for function calls in the minimax
#22015 merged
Aug 7, 2025 -
Use float32 for test_completion.py
#22385 merged
Aug 7, 2025 -
[Bugfix] Fix wrong method name in Intern-S1 image processor
#22417 merged
Aug 7, 2025 -
[Qwen3] Enable dual-chunk-attention support for Qwen3 models.
#21924 merged
Aug 7, 2025 -
[XPU]Fix
flash_attn_varlen_func
interface on xpu#22350 merged
Aug 7, 2025 -
Support encoder_only attention for FlexAttention
#22273 merged
Aug 7, 2025 -
[model] Support MiniCPM-V 4.0
#22166 merged
Aug 7, 2025 -
Update
flashinfer-python==0.2.10
#22389 merged
Aug 7, 2025 -
Fix trtllm-gen attention env and add attention sink
#22378 merged
Aug 7, 2025 -
[gpt-oss] fix model config with hf_config
#22401 merged
Aug 7, 2025 -
[gpt-oss] add demo tool server
#22393 merged
Aug 7, 2025 -
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue
#22399 merged
Aug 7, 2025 -
[v1] - Mamba1 Attention Metadata
#21249 merged
Aug 7, 2025 -
[gpt-oss] flashinfer mxfp4
#22339 merged
Aug 6, 2025 -
[gpt-oss] attention sink init fix gemini
#22335 merged
Aug 6, 2025 -
[gpt-oss] Add loop for built-in tool call
#22374 merged
Aug 6, 2025 -
[Bugfix] Make condition in triton kernel constexpr
#22370 merged
Aug 6, 2025 -
[BugFix] Fix triton compile error in
kernel_unified_attention_2/3d
caused by attention sinks#22368 merged
Aug 6, 2025 -
add the codes to check AMD Instinct GPU number
#22367 merged
Aug 6, 2025 -
[BugFix] Fix FA2 RuntimeError when sinks is provided
#22365 merged
Aug 6, 2025 -
[Minor] Fix type
#22347 merged
Aug 6, 2025 -
[gpt-oss] Support chat completion api
#22342 merged
Aug 6, 2025 -
[gpt-oss] add model to supported models doc
#22336 merged
Aug 6, 2025 -
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils
#22340 merged
Aug 6, 2025 -
[Misc] Clean up duplicated hf overrides
#22311 merged
Aug 6, 2025 -
[gpt-oss] Add openai-harmony as default dependency
#22332 merged
Aug 6, 2025 -
[gpt-oss] flashinfer attention sink init
#22330 merged
Aug 6, 2025 -
[GptOss] Add GptOss reasoning parser to support structure output
#22322 merged
Aug 6, 2025 -
[ROCm] Add attention sink to use_rocm_custom_paged_attention
#22329 merged
Aug 6, 2025 -
Add GPT-OSS model code and config [1/N]
#22327 merged
Aug 6, 2025 -
Update transformers to
v4.55
#21931 merged
Aug 6, 2025 -
Add attention sink in attention backends
#22320 merged
Aug 6, 2025 -
Increase openai-python version
#22316 merged
Aug 6, 2025 -
Upgrade FA3 for attention sink
#22313 merged
Aug 6, 2025 -
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm
#22264 merged
Aug 6, 2025 -
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation
#22275 merged
Aug 6, 2025 -
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding
#21862 merged
Aug 6, 2025 -
[Bugfix] Fix MoE BNB version
#22260 merged
Aug 6, 2025 -
[Bugfix] Fix 3D input passed into cutlass_scaled_mm
#22278 merged
Aug 6, 2025 -
[Bugfix] Remove faulty test for oot attention backend
#22286 merged
Aug 6, 2025 -
[CI][TPU] Fix docker clean up
#22271 merged
Aug 5, 2025 -
[bugfix] fix blackwell deepep installation
#22255 merged
Aug 5, 2025 -
[V1] port xformers backend to v1
#21342 merged
Aug 5, 2025 -
[Refactor] Remove Unused Environment Variable
VLLM_NO_DEPRECATION_WARNING
#22199 merged
Aug 5, 2025 -
[CI/Build] Update flashinfer to 0.2.9
#22233 merged
Aug 5, 2025 -
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail
#22128 merged
Aug 5, 2025 -
[V0 Deprecation][TPU] Remove V1 flag check from tests
#22248 merged
Aug 5, 2025 -
[Misc] correct static type check for GroupCoordinator
#21946 merged
Aug 5, 2025 -
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel
#22095 merged
Aug 5, 2025 -
[Feature] Non-contiguous Support for FP8 Quantization
#21961 merged
Aug 5, 2025 -
Migrate KimiVLImagePixelInputs to TensorSchema
#21769 merged
Aug 5, 2025 -
[Docs][TPU] Highlight TPU Software version selection
#22242 merged
Aug 5, 2025 -
[Model] Pooling model activation supports per request control by PoolingParams
#20538 merged
Aug 5, 2025 -
[Core] Factor out common logic for MM budget calculation
#22228 merged
Aug 5, 2025 -
[UX] Fail if an invalid attention backend is specified
#22217 merged
Aug 5, 2025 -
[Bugfix] Misaligned params in TreeAttentionImpl
#22226 merged
Aug 5, 2025 -
Optimize configuration access with LRU cache in custom ops
#22204 merged
Aug 5, 2025 -
[Misc] log more detailed message for ensure_model_parallel_initialized
#22144 merged
Aug 5, 2025 -
[Doc] add backend to doc string of initialize_model_parallel
#22142 merged
Aug 5, 2025 -
[Misc] Remove pass_config from CompilationConfig dump_json excluded
#21911 merged
Aug 5, 2025 -
fix: kimi_k2 return empty tool call list
#22149 merged
Aug 5, 2025 -
[Log] DeepGEMM Update Log for Unaligned Problem Size
#22208 merged
Aug 5, 2025 -
self.gate dtype update for GLM-4.5
#22203 merged
Aug 5, 2025 -
[ROCm][Bugfix] Compilation passes fix
#22202 merged
Aug 5, 2025 -
[FEAT] Refactor ROPE into module
#22192 merged
Aug 5, 2025 -
[V0 deprecation][P/D] Deprecate v0
KVConnectorBase
code (1/2)#21785 merged
Aug 5, 2025 -
[V1] reduce block size for tree attention correctness test to fix 'ou…
#22207 merged
Aug 5, 2025 -
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling."
#22223 merged
Aug 5, 2025 -
[Bugfix] V1 Fix the cursor leakage issue during request scheduling.
#21173 merged
Aug 5, 2025 -
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading
#22073 merged
Aug 5, 2025 -
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector
#21819 merged
Aug 4, 2025 -
[Bug] Update auto_tune.sh to separate benchmarking and profiling.
#21629 merged
Aug 4, 2025 -
[Responses API] Ignore
store=True
and process the request by default#22185 merged
Aug 4, 2025 -
Fix Arcee model weight loading: Add custom load_weights
#21725 merged
Aug 4, 2025 -
[Doc] Update pooling model docs
#22186 merged
Aug 4, 2025 -
[Sampler] Support returning all logprobs or logits
#21792 merged
Aug 4, 2025 -
[Bugfix] Fix failing GGUF models test
#22174 merged
Aug 4, 2025 -
[feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading
#21164 merged
Aug 4, 2025 -
[Misc] Modify the organization of GLM series
#22171 merged
Aug 4, 2025 -
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes
#22163 merged
Aug 4, 2025 -
Remove index_put from MM embeddings merging
#22105 merged
Aug 4, 2025 -
[refactor] improve ConstantList exception specificity
#22156 merged
Aug 4, 2025 -
Add tree attention backend for v1 (part 1)
#20401 merged
Aug 4, 2025 -
remove duplicate code within cleanup_dist_env_and_memory
#22147 merged
Aug 4, 2025 -
[PD] add test for chat completions endpoint
#21925 merged
Aug 4, 2025 -
[RLHF] Fix torch.dtype not serializable in example
#22158 merged
Aug 4, 2025 -
[fix] fix correct assertion syntax error in attention utils.
#22154 merged
Aug 4, 2025 -
Use
aiohttp
connection pool for benchmarking#21981 merged
Aug 4, 2025
123 Pull requests opened by 97 people
-
[Misc]add replicaid to ray metrics
#22159 opened
Aug 4, 2025 -
[Misc] Minor fixes and cleanups for elastic EP
#22160 opened
Aug 4, 2025 -
[Bugfix] Support full cuda graph with sliding window attention
#22168 opened
Aug 4, 2025 -
[Model][V1] Support Ernie MTP
#22169 opened
Aug 4, 2025 -
[Bugfix] Fix erroneous randomly generated cases in bad word testing
#22170 opened
Aug 4, 2025 -
Enable multi-stream for Llama4 q/k_norm and MoE
#22175 opened
Aug 4, 2025 -
[Performance] EPLB Execution Optimization
#22179 opened
Aug 4, 2025 -
enable dp for custom devices on ray executor.
#22181 opened
Aug 4, 2025 -
[P/D][Nixl] Introduce `KVTransferMetrics` and aggregation strategy
#22188 opened
Aug 4, 2025 -
[Model] Mamba models - Support FP32 SSM cache
#22196 opened
Aug 4, 2025 -
[Misc] Update HunYuan dense model test
#22200 opened
Aug 4, 2025 -
[Misc] Improve Worker process title
#22205 opened
Aug 4, 2025 -
[Bugfix] fix hash error for chunked local attention hybrid KV
#22209 opened
Aug 4, 2025 -
[Perf] Support topk softmax fused kernel for broader num_experts
#22211 opened
Aug 4, 2025 -
feat: Add native support for XLM-RoBERTa embedding and BAAI/bge-reranker-v2-m3
#22216 opened
Aug 4, 2025 -
[Frontend] Added parallel_tool_calls option on the openai API with Guided Decoding
#22218 opened
Aug 4, 2025 -
[Quant] Refactor CompressedTensorsConfig
#22219 opened
Aug 4, 2025 -
Fp8 paged attention update
#22222 opened
Aug 5, 2025 -
[Bugfix] Disable the statslogger if the api_server_count is greater than 1
#22227 opened
Aug 5, 2025 -
[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy
#22236 opened
Aug 5, 2025 -
fix(worker): adjust memory requirement calculation for GPU worker
#22237 opened
Aug 5, 2025 -
[Platform] allow platform to init dp group
#22243 opened
Aug 5, 2025 -
[Misc] benchmark_moe supports expert parallel
#22251 opened
Aug 5, 2025 -
[BugFix] Fix port lookup in internal DP LB tests
#22252 opened
Aug 5, 2025 -
[TPU][Misc] Fix TPU.device_name
#22254 opened
Aug 5, 2025 -
[Misc] Use comma-separated string for --kernels to avoid greedy consumption of the subparser command
#22258 opened
Aug 5, 2025 -
Support gpt-oss
#22259 opened
Aug 5, 2025 -
[Fix] apply_temperature() casuing `inf` logits
#22261 opened
Aug 5, 2025 -
[V1][Spec Decode] Async scheduling integration with spec decode
#22262 opened
Aug 5, 2025 -
Il tool compare tool
#22263 opened
Aug 5, 2025 -
Support conditional torch.compile per module
#22269 opened
Aug 5, 2025 -
`NixlConnector` Support HTTP/S metadata exchange instead of zmq
#22274 opened
Aug 5, 2025 -
[Frontend] Add chunked processing to handle long inputs in embedding models
#22280 opened
Aug 5, 2025 -
Only convert output to weakref for last graph across all compilation units
#22282 opened
Aug 5, 2025 -
[Core] Return final response for aborted requests from `AsyncLLM.generate`
#22283 opened
Aug 5, 2025 -
[Feat] Allow custom `comm_group` in ParallelLinear layers
#22309 opened
Aug 6, 2025 -
Removing redundant installation of torch for CPU build
#22318 opened
Aug 6, 2025 -
[V1] Test cases for parallel sampling with all output_kind options
#22321 opened
Aug 6, 2025 -
[Not Merged] gpt-oss rc1 rebased
#22328 opened
Aug 6, 2025 -
fix(cuda): add arch 8.9 to support NVIDIA L4 GPU
#22343 opened
Aug 6, 2025 -
[Model]Force use triton compressed_tensor_moe instead of cutlass
#22345 opened
Aug 6, 2025 -
[Kernel] Add nvfp4 gemm flashinfer backends
#22346 opened
Aug 6, 2025 -
[Model] NemotronH Support
#22349 opened
Aug 6, 2025 -
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated
#22355 opened
Aug 6, 2025 -
NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend
#22357 opened
Aug 6, 2025 -
[Bugfix] fixed the fp8liner init
#22360 opened
Aug 6, 2025 -
Improved defaulting of chunked prefill and prefix caching in V1
#22362 opened
Aug 6, 2025 -
Layered Dockerfile for smaller size and faster image pulling
#22377 opened
Aug 6, 2025 -
[gpt-oss] tool parser supports for /chat/completions [1/n]
#22386 opened
Aug 6, 2025 -
[Sampler] Support returning final logprobs
#22387 opened
Aug 6, 2025 -
N gram
#22390 opened
Aug 6, 2025 -
Ability to use custom-all-reduce on systems with more than 2 PCIe GPUs via env var
#22392 opened
Aug 6, 2025 -
[BugFix] Don't match ScaledMM patterns unless torch._scaled_mm is available
#22400 opened
Aug 6, 2025 -
[Metrics] Expose num_requests_waiting_by_priority metric
#22404 opened
Aug 6, 2025 -
[feature] add all_reduce registry for custom backends
#22406 opened
Aug 6, 2025 -
[XPU] support ray distribute executor on XPU
#22413 opened
Aug 7, 2025 -
WIP - Hack xpu memory - my attempt at fixing issue #20743
#22415 opened
Aug 7, 2025 -
[Misc] Add pre-commit __init__.py check
#22418 opened
Aug 7, 2025 -
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel
#22428 opened
Aug 7, 2025 -
[Bugfix] Bypass `is_kv_cache_type_uniform check` when kvcache is disabled
#22429 opened
Aug 7, 2025 -
[gpt-oss] Support streaming in response API
#22431 opened
Aug 7, 2025 -
[XPU][P/D] Add XPU support in NixlConnector
#22436 opened
Aug 7, 2025 -
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP
#22437 opened
Aug 7, 2025 -
[Speculators][Speculative Decoding] Add Eagle3 support for Qwen2
#22438 opened
Aug 7, 2025 -
Fixed vllm build and runtime openai tests on ppc64le
#22443 opened
Aug 7, 2025 -
[Model] use autoWeightsLoader for gptoss
#22446 opened
Aug 7, 2025 -
support silu+nvfp4 quant fusion
#22448 opened
Aug 7, 2025 -
[Bugfix] Added more env vars to hash
#22449 opened
Aug 7, 2025 -
Fix nvfp4 swizzling
#22450 opened
Aug 7, 2025 -
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` implementations
#22456 opened
Aug 7, 2025 -
[Feature] add procese set cpu affinity current gpu device
#22461 opened
Aug 7, 2025 -
[Quantization]: Support compressed-tensors mixed-precision model loading
#22468 opened
Aug 7, 2025 -
vllm fix check on max vocab size
#22471 opened
Aug 7, 2025 -
[V1][P/D]Bug fix: handle edge case where KVConnectorOutput is None
#22473 opened
Aug 7, 2025 -
[Refactor] Refactor FP8 & INT8 Quant Folder inside `8bit`
#22474 opened
Aug 7, 2025 -
[Attention] FA3 Attention Sinks Perf Boost
#22478 opened
Aug 8, 2025 -
[Structured Output] Make the output of structured output example more complete
#22481 opened
Aug 8, 2025 -
[Transform] [Quantization] Add transforms to compressed tensors
#22486 opened
Aug 8, 2025 -
Feat/sliding window metrics — Related to #22480
#22488 opened
Aug 8, 2025 -
consistency between the test and final Docker image
#22490 opened
Aug 8, 2025 -
[CI] Add end-to-end V1 min_tokens test coverage
#22495 opened
Aug 8, 2025 -
[Debugging] Add annotation for easier trace analysis
#22496 opened
Aug 8, 2025 -
[XPU] Fix OOM issue for data parallel with Ray backend
#22500 opened
Aug 8, 2025 -
[Platform] Custom ops update
#22509 opened
Aug 8, 2025 -
Fix Llama4 FlashInfer FP4 MoE issues
#22511 opened
Aug 8, 2025 -
[gpt-oss] Small bug fixes for frontend
#22512 opened
Aug 8, 2025 -
[WIP][Model] Add Ernie4.5 VL Model Support
#22514 opened
Aug 8, 2025 -
code clean when clean shm
#22516 opened
Aug 8, 2025 -
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module.
#22521 opened
Aug 8, 2025 -
[WIP] [Bench] Add Triton NVFP4 GEMM
#22523 opened
Aug 8, 2025 -
[Fix] fix offline env use local mode path
#22526 opened
Aug 8, 2025 -
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs
#22527 opened
Aug 8, 2025 -
Fix torch version check for mxfp4
#22535 opened
Aug 8, 2025 -
New moe quant config
#22537 opened
Aug 8, 2025 -
[Kernel] Add cuda kernel for gpt_oss activation
#22538 opened
Aug 8, 2025 -
add tg-mxfp4-moe-test
#22540 opened
Aug 8, 2025 -
[WIP, Do not review] DI
#22542 opened
Aug 8, 2025 -
[Bugfix][V1] Fix Finished Request Handling in Async Scheduling
#22543 opened
Aug 8, 2025 -
[V1][Hybrid][Backend] Allow different data types in Hybrid Cache Manager
#22544 opened
Aug 8, 2025 -
[Misc] fail fast when exception is raised in in_the_same_node_as
#22553 opened
Aug 8, 2025 -
[gpt-oss] Add test for response API + harmony (but skipped)
#22554 opened
Aug 9, 2025 -
[BugFix] EAGLE Load Bias From Config
#22558 opened
Aug 9, 2025 -
Frontend: Adding LM Format Enforcer support to V1 engine
#22564 opened
Aug 9, 2025 -
[Core] Use individual MM items in P0/P1 cache and model runner
#22570 opened
Aug 9, 2025 -
optimize: improve scheduler policy lookup performance
#22573 opened
Aug 9, 2025 -
[Core][BugFix] Fix thread safety issue in RequestOutputCollector
#22576 opened
Aug 9, 2025 -
Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu)
#22577 opened
Aug 10, 2025 -
[CI/Build] Fix tensorizer test for load_format change
#22583 opened
Aug 10, 2025 -
[Misc][gpt-oss] guard import when triton kernel when not up to date
#22584 opened
Aug 10, 2025 -
Add return_token_ids_alongside parameter to OpenAI API endpoints
#22587 opened
Aug 10, 2025 -
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models
#22589 opened
Aug 10, 2025 -
[BugFix] Fix logits repetition penalty cuda check
#22592 opened
Aug 10, 2025 -
[V1] [Hybrid] Enable Full CUDA graph by default for models with mamba2 layers in V1
#22594 opened
Aug 10, 2025 -
v1: Offloading connector
#22595 opened
Aug 10, 2025 -
[BugFix] Fix KVConnectorOutput TPU breakage
#22598 opened
Aug 10, 2025 -
[Feature] Improve logging for error messages
#22599 opened
Aug 10, 2025 -
[Misc][gpt-oss] Add rules to label gpt-oss related PRs
#22600 opened
Aug 10, 2025 -
[Docs] Add comprehensive CLI reference for all large `vllm` subcommands
#22601 opened
Aug 10, 2025 -
Vectorize RMSNorm CUDA kernel
#22602 opened
Aug 10, 2025 -
minor: zero workspace buffer init for flashinfer trtllm-gen attn
#22603 opened
Aug 10, 2025
172 Issues closed by 37 people
-
[Feature]: Serving OpenAI model in bf16
#22277 closed
Aug 10, 2025 -
[Doc]:
#22582 closed
Aug 10, 2025 -
[Docs] Feedback for `/en/latest/examples/offline_inference/async_llm_streaming.html`
#22581 closed
Aug 10, 2025 -
[Docs] Feedback for `/en/latest/examples/offline_inference/async_llm_streaming.html`
#22580 closed
Aug 10, 2025 -
[Bug]: No output / Repeated outputs when using Gemma 3 on vLLM
#20341 closed
Aug 10, 2025 -
AttributeError: 'Gemma3TextConfig' object has no attribute 'interleaved_sliding_window'
#22270 closed
Aug 10, 2025 -
[Bug]: Major issues with transformers version causing rubbish generations with Gemma3 family using vllm
#22475 closed
Aug 10, 2025 -
[Bug]: Gemma-3-12B-it model getting stuck in repetitive output loops
#15752 closed
Aug 10, 2025 -
[Bug]: gemma3 shows degraded accuracy in vLLM v0.8.4
#17689 closed
Aug 10, 2025 -
[Bug]: Worker VllmWorkerProcess pid 000000 died, exit code: -15
#15295 closed
Aug 10, 2025 -
[Bug]: Stuck When Launching Llama-4-Maverick-17B-128E-Instruct-FP8
#16152 closed
Aug 10, 2025 -
[Usage]: Multiple Models on Same Port
#16232 closed
Aug 10, 2025 -
[Performance]: FP8 does not demonstrate an inference speed superior to that of FP16
#16261 closed
Aug 10, 2025 -
[Performance]: H100 Optimisation Configuration For Offline Inferencing
#16265 closed
Aug 10, 2025 -
[Usage]: how to redirect save logs to local file.
#16319 closed
Aug 10, 2025 -
[Bug]: Used VRAM is less than GPU memory utilization on a heterogeneous setup
#16382 closed
Aug 10, 2025 -
[Usage]: --kv-transfer-config is not supported by the V1 Engine. Falling back to V0.
#16395 closed
Aug 10, 2025 -
[Bug]: corrupted double-linked list (not small) Aborted
#16412 closed
Aug 10, 2025 -
[Feature]: Refactor config.py for non-cuda backend
#16468 closed
Aug 10, 2025 -
[Usage]: How to implement Streaming Input
#16469 closed
Aug 10, 2025 -
[Usage]: failed to run PD LMCache example code
#16471 closed
Aug 10, 2025 -
[Bug]: Medusa speculation hangs when tp > 1
#16477 closed
Aug 10, 2025 -
[Usage]: Inferencing with DeepSeek R1
#16498 closed
Aug 10, 2025 -
[New Model]: Gemma 3n support
#18476 closed
Aug 9, 2025 -
Generate nothing from VLLM output
#1185 closed
Aug 9, 2025 -
[Bug]: LLMEngine.add_request can't handle erroneous type of request_id
#19588 closed
Aug 9, 2025 -
[CI Failure]: test_hybrid.py::test_models undefined symbol: _ZN3c104cuda9SetDeviceEab
#22395 closed
Aug 9, 2025 -
[Usage]: vLLM support for FP8 models (QWEN3 FP8) on RTX 50 series / SM120
#21648 closed
Aug 9, 2025 -
[Bug]: ValueError:Could not broadcast input array from shape (542,) into shape (512,)
#9963 closed
Aug 9, 2025 -
[Usage]: `Phi-4-multimodal-instruct` activate LoRA module but get mangled text output
#15440 closed
Aug 9, 2025 -
[Bug]: Unable to run Phi4 with tensor-parallel-size 4 torch.compile compatiblity
#16021 closed
Aug 9, 2025 -
[RFC]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22549 closed
Aug 8, 2025 -
[Feature]: support binding on Unix Domain Sockets (UDS)
#13907 closed
Aug 8, 2025 -
[RFC]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22547 closed
Aug 8, 2025 -
[Usage]:
#22552 closed
Aug 8, 2025 -
[RFC]:
#22548 closed
Aug 8, 2025 -
[RFC]: vllm 0.10.1gpt-oss,speculative sampling be use?
#22550 closed
Aug 8, 2025 -
[Bug]: Dependencies versions not pinned in tools/ep_kernels/install_python_libraries.sh
#22467 closed
Aug 8, 2025 -
[CI Failure]: V1 Test
#22172 closed
Aug 8, 2025 -
[Bug]: Smollm3M not working anymore
#22517 closed
Aug 8, 2025 -
[Usage]: 如何使用 vllm 进行跨模态的离线推理
#19429 closed
Aug 8, 2025 -
[Bug]: vLLM serve `google/gemma-3-1b-it` with version `0.8.5` interrupted `SIGTERM`
#17386 closed
Aug 8, 2025 -
[Bug]: vllm部署Gemma-3-27b问题
#16380 closed
Aug 8, 2025 -
[Bug]: gemma 3 structured output api occurs assertion error
#15766 closed
Aug 8, 2025 -
[Bug]: vllm 0.8.3 serve error
#15457 closed
Aug 8, 2025 -
[Bug]: Gemma-3-27b-it-GPTQ Can't run in sm75, vllm-0.7.4.dev
#14814 closed
Aug 8, 2025 -
[Feature]: gemma3 raise error
#14723 closed
Aug 8, 2025 -
[Bug]: reasoning parser for Qwen/Qwen3-4B-Thinking-2507
#22507 closed
Aug 8, 2025 -
[Bug]: V100启动卡住
#22503 closed
Aug 8, 2025 -
[Bug]: OSS-20B [backend_xgrammar.py:160] Failed to advance FSM
#22483 closed
Aug 8, 2025 -
[Bug]: RuntimeError: Index put requires the source and destination dtypes match
#22064 closed
Aug 8, 2025 -
[Bug]: EPLB load statistics problem
#21883 closed
Aug 8, 2025 -
[Bug]: Not support MiniCPM-o 2.6 ‘s finetune lora
#13018 closed
Aug 8, 2025 -
[Feature]: deepseek-r1-w8a8
#14176 closed
Aug 8, 2025 -
[Bug]: Speculative decoding with a draft model makes generation slower
#15025 closed
Aug 8, 2025 -
DeciLMConfig object has no attribute ‘num_key_value_heads_per_layer’
#15625 closed
Aug 8, 2025 -
[Bug]: Compliing for CPU fails due to wrong python setup call
#15953 closed
Aug 8, 2025 -
[Usage]: Llama-3.1-8B-Instruct
#16207 closed
Aug 8, 2025 -
[New Model]: LGAI-EXAONE/EXAONE-Deep-7.8B GGUF
#16224 closed
Aug 8, 2025 -
[Bug]: (Maybe) Input preprocessing blocks the async operations
#17883 closed
Aug 8, 2025 -
[Usage]: ERROR:root:Compiled DAG task exited with exception
#16242 closed
Aug 8, 2025 -
[Usage]: Failed to get global TPU topology.
#16243 closed
Aug 8, 2025 -
[New Model]: efficient-speech/lite-whisper-large-v3
#16244 closed
Aug 8, 2025 -
[Usage]: Async generate with offline LLM interface
#16251 closed
Aug 8, 2025 -
[Usage]: How to use xPyD disaggregated prefilling
#16253 closed
Aug 8, 2025 -
[Feature]: Will you add padding for intermediate_size just like lmdeploy?
#16260 closed
Aug 8, 2025 -
[Feature]: ray logs too large
#16262 closed
Aug 8, 2025 -
[Bug]: invalid responses when generating yaml format
#16269 closed
Aug 8, 2025 -
[Usage]: Force output for log probabilities
#16281 closed
Aug 8, 2025 -
[Bug]: Deepseek R1 gives nonsensical tokens during offline Inference
#16285 closed
Aug 8, 2025 -
[Performance]: Performance degradation with tp=8 compared to tp=4 on 8xA100(80G)
#16300 closed
Aug 8, 2025 -
[Usage]: When performing inference with vLLM, it keeps getting stuck at 0%.
#16303 closed
Aug 8, 2025 -
[Usage]: OpenAI client to vllm with lora
#16304 closed
Aug 8, 2025 -
[Usage]: VocabParallelEmbedding raise error embedding: [-130., -130., .....]
#16305 closed
Aug 8, 2025 -
[Feature]: Add task perplexity mode to optimize PPL evaluation
#16324 closed
Aug 8, 2025 -
[Usage]: it Says not able to read the model architecture
#16326 closed
Aug 8, 2025 -
[Feature]: Soft Prompts?
#16351 closed
Aug 8, 2025 -
[Usage]: vllm serve: how to get KV cache percentage on CPU and GPU?
#16368 closed
Aug 8, 2025 -
[Bug]: vllm server bug with qwen 2.5 gguf
#16374 closed
Aug 8, 2025 -
[Bug]: ROCm NotImplementedError: Speculative decoding is not yet supported on vLLM V1
#21308 closed
Aug 8, 2025 -
[Bug]: Flashinfer 0.2.10 not supported
#22455 closed
Aug 7, 2025 -
[Feature]: Support attention backend with FlexAttention
#7315 closed
Aug 7, 2025 -
[Bug]: TypeError: flash_attn_varlen_func() got an unexpected keyword argument 'num_splits'
#20556 closed
Aug 7, 2025 -
[RFC]: How to use OpenAI's "Reasoning Levels" (low/medium/high) in vLLM?
#22359 closed
Aug 7, 2025 -
[Bug]: Incorrect version judgment method for flashinfer
#22297 closed
Aug 7, 2025 -
[feat] vLLM generation deterministic option/flag
#2910 closed
Aug 7, 2025 -
ExLlamaV2: exl2 support
#3203 closed
Aug 7, 2025 -
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 closed
Aug 7, 2025 -
[Bug]: gpu_memory_utilization affects generation quality
#15763 closed
Aug 7, 2025 -
[Bug]: benchmark a vllm cluster error
#22351 closed
Aug 7, 2025 -
[Bug]: docker image vllm/vllm-openai:gptoss can not load gpt-oss-120b on H100. OutOfMemoryError
#22358 closed
Aug 7, 2025 -
[Bug]: gpt oss 401 unauthorized error
#22407 closed
Aug 6, 2025 -
[Bug]: ModelConfig validation error when serving chatgpt-oss models
#22301 closed
Aug 6, 2025 -
[Bug]:
#22304 closed
Aug 6, 2025 -
[Bug]:
#22303 closed
Aug 6, 2025 -
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22306 closed
Aug 6, 2025 -
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22305 closed
Aug 6, 2025 -
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP faild
#22302 closed
Aug 6, 2025 -
[Bug]: vllm will return empty string when request.stop is not null
#11089 closed
Aug 6, 2025 -
[RFC]: BatchLLM for better shared prefix utilizing in offline scenarios
#12080 closed
Aug 6, 2025 -
[Bug]: deepseek r1 + vllm (v0.7.2) torch.compile error
#13471 closed
Aug 6, 2025 -
[Feature]: DeepSeek v3/r1 MTP support PP
#14005 closed
Aug 6, 2025 -
[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0
#14286 closed
Aug 6, 2025 -
[Bug]: Once again, there is an imbalance in memory usage in pipeline parallel deployment
#14763 closed
Aug 6, 2025 -
[Usage]: Clarification on how to use Greedy Search and then Beam search's Poor Performance in VLLM
#15146 closed
Aug 6, 2025 -
[Bug]: Missing `Xformers` on `rocm/vllm:rocm6.3.1_instinct_vllm0.7.3_20250311` for pixtral-12b-2409
#15972 closed
Aug 6, 2025 -
[Usage]: How to run multiple models in the same docker instance
#16065 closed
Aug 6, 2025 -
[Bug]: Llama4 does not follow basic instructions properly
#16125 closed
Aug 6, 2025 -
[Bug]: memory usage is greater than expected
#16184 closed
Aug 6, 2025 -
[Installation]: the --mount option requires BuildKit
#16205 closed
Aug 6, 2025 -
[Bug]: RMSNorm not checking for input shape in forward_cuda
#16329 closed
Aug 6, 2025 -
[CI Failure]: Plugin Tests (2 GPUs) - plugins_tests/test_platform_plugins.py::test_oot_attention_backend
#22285 closed
Aug 6, 2025 -
[Feature]: Support for Qwen/Qwen-Image
#22212 closed
Aug 5, 2025 -
[Bug]: when tensor-parallel-size>1,Stuck
#8087 closed
Aug 5, 2025 -
[Bug]: Can Not load model Qwen2-VL-72B-Instruct in Vllm
#11608 closed
Aug 5, 2025 -
[Bug]: vLLM gets stuck at pynccl.py during startup
#12292 closed
Aug 5, 2025 -
[Bug]: different logprobs for qwn2-vl when running on transformers and on vllm
#12699 closed
Aug 5, 2025 -
[V1] [Performance Benchmark] Benchmark the performance of Speculative Decoding
#15600 closed
Aug 5, 2025 -
[Bug]: OpenAI-Compatible Server cannot be requested
#15675 closed
Aug 5, 2025 -
[Bug]: Worker died during distributed inference
#15687 closed
Aug 5, 2025 -
[Bug]: Null response for Mistral3.1
#16014 closed
Aug 5, 2025 -
[Bug]: allow input token logprobs output for multimodal/VLLM
#16107 closed
Aug 5, 2025 -
[Usage]: Asking for help: vllm0.7.2 deploy DeepSeek-R1-int4-gptq-sym-inc
#16111 closed
Aug 5, 2025 -
[Feature]: Support loading GGUF from a remote repo directly
#22210 closed
Aug 4, 2025 -
[Doc]: docs opening speed is much slower than before
#22182 closed
Aug 4, 2025 -
[Bug]: vllm ignores existing pytorch installation
#21745 closed
Aug 4, 2025 -
[CI Failure]: Quantized Models Test - models/quantization/test_gguf.py::test_models
#22136 closed
Aug 4, 2025 -
[Bug]: can not find model error when use docker deploy
#22164 closed
Aug 4, 2025 -
[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication
#9519 closed
Aug 4, 2025 -
[Bug]: Issue with extra [TOOL_CALLS] prefix in function call outputs in chat_with_tools.py
#13473 closed
Aug 4, 2025 -
[Bug]: Gibberish Output from LLaMA 3.1 8B using vLLM with xGrammar
#13828 closed
Aug 4, 2025 -
[New Model]: No supported config format found in deepseek-vl2-small
#14105 closed
Aug 4, 2025 -
Does VLLM support structured pruning?
#15854 closed
Aug 4, 2025 -
[Bug]: v1 0.8.2 gets weird performance results
#15947 closed
Aug 4, 2025 -
[Bug]: TypeError: __init__() missing 1 required positional argument: 'inner_exception'
#16009 closed
Aug 4, 2025 -
[Doc]: Steps to run 2 different models on Kaggle GPUs using vllm
#16051 closed
Aug 4, 2025 -
[Bug]: CI flake - v1/engine/test_async_llm.py::test_abort - assert has_unfinished_requests()
#16054 closed
Aug 4, 2025 -
[Bug]: KeyError: 'local_attn_masks' on running gemma3 models with kv-cache quantization
#16061 closed
Aug 4, 2025 -
[Bug]: vllm server stops after torch.compile step for multi-gpu setup
#16092 closed
Aug 4, 2025 -
[Feature]: Will beam search utilize CUDA graphs to improve its speed in the future?
#16099 closed
Aug 4, 2025 -
[Feature]: wheel for vllm swiftkv
#16108 closed
Aug 4, 2025 -
[Doc]: Syntax error of example code in structured_outputs.md
#21914 closed
Aug 4, 2025
115 Issues opened by 109 people
-
[Feature]: gpt-oss tool parser
#22604 opened
Aug 10, 2025 -
[Bug]: CPU penalty operations fail on CUDA-capable systems
#22591 opened
Aug 10, 2025 -
[Bug]: ROCm build falls back to default arch despite ARG_PYTORCH_ROCM_ARCH set in Dockerfile.rocm
#22590 opened
Aug 10, 2025 -
[Bug]: [gpt-oss-120b] Chat Completions endpoint tool_call support is not working
#22578 opened
Aug 10, 2025 -
[Bug]: Vllm hangs when I use the offline engine with dp = 2 or more
#22575 opened
Aug 9, 2025 -
[Bug]: Llama3.1 8B failing to load, _tensor has no operation split, v0.9.2 & v .10.0
#22572 opened
Aug 9, 2025 -
[Feature]: subfolder parameter for EngineArgs
#22569 opened
Aug 9, 2025 -
[Usage]: VRAM spike while loading gemma3-12b bnb on vllm-0.10
#22568 opened
Aug 9, 2025 -
[CI Failure]: Distributed Tests (2 GPUs) - Mllama TP=2 results divergence and deadlock issue
#22559 opened
Aug 9, 2025 -
[Usage]: vllm 0.10.1gpt-oss version, will speculative sampling be used by default?
#22551 opened
Aug 8, 2025 -
[Bug]: Stats don't update to zero when all requests are aborted
#22545 opened
Aug 8, 2025 -
[Bug]: gpt-oss-20b flaky BadRequest 400
#22533 opened
Aug 8, 2025 -
[Bug]: NIXL disaggregation example does not work
#22532 opened
Aug 8, 2025 -
[Usage]: How to choose max_num_batched_tokens when chunked prefill is enabled
#22531 opened
Aug 8, 2025 -
[Bug]: (gpt-oss-20b) openai_harmony.HarmonyError: error downloading or loading vocab file
#22525 opened
Aug 8, 2025 -
[Usage]: Custom function-based stopping criteria
#22522 opened
Aug 8, 2025 -
[Bug]: [gpt oss 20b] [tool_call] Unexpected token 12606 while expecting start token 200006
#22519 opened
Aug 8, 2025 -
[Bug]: Fine-tuned DeepSeek-R1-Distill-Qwen-1.5B generates only exclamation marks (token 0) on Ascend NPU
#22518 opened
Aug 8, 2025 -
[Bug]: GPT-OSS 20b/120b [backend_xgrammar.py:160] Failed to advance FSM for request
#22513 opened
Aug 8, 2025 -
qwen2.5_omni_7b response audio
#22510 opened
Aug 8, 2025 -
[Usage]: vllm failed to run on multiple gpu
#22505 opened
Aug 8, 2025 -
is support for CohereLabs/command-a-vision-07-2025 available?
#22504 opened
Aug 8, 2025 -
[Bug]: openai/gpt-oss-120b can't run on A100
#22502 opened
Aug 8, 2025 -
[Usage]: Running a 300-400B Parameter Model on Multi-Node Setup (2x 8xA100)
#22501 opened
Aug 8, 2025 -
[Bug]: Crashed when loading ggml quantized Gemma3
#22497 opened
Aug 8, 2025 -
[RFC]: Should the gpt-oss reasoning parser use harmony directly?
#22493 opened
Aug 8, 2025 -
[Bug]: HF_HUB_OFFLINE Parameter does not take effect
#22492 opened
Aug 8, 2025 -
[Bug]: 请问用同样的模型qwen2.5-72b-int4,在单卡和双卡上输出结果会不一样
#22484 opened
Aug 8, 2025 -
[Feature]: Add Moving Average Statistics for Better Performance Monitoring
#22480 opened
Aug 8, 2025 -
[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10
#22479 opened
Aug 8, 2025 -
[Bug]: Trailing newline in prompt affects output
#22477 opened
Aug 8, 2025 -
[Performance]: Very low prefill speed on CPU with BART (encoder-decoder)
#22472 opened
Aug 7, 2025 -
[Bug]: gpt oss 20/120b generates wired characters and fails later when i use them
#22470 opened
Aug 7, 2025 -
[Usage]: can't seem to use batching correcly
#22465 opened
Aug 7, 2025 -
[Feature]: Please add support for Amd Instinct MI50
#22458 opened
Aug 7, 2025 -
[Bug]: Low time to first token of prefill and decode instances but high TTFT with 1p1d
#22454 opened
Aug 7, 2025 -
[Bug]: Qwen3 V1 custom rotary embedding ops breaks torch compile graph
#22453 opened
Aug 7, 2025 -
[Doc]: Data Parallel script gives `RuntimeError: NCCL error`
#22452 opened
Aug 7, 2025 -
[Usage]: Can runtime static values be retrieved for `AsyncLLM`
#22447 opened
Aug 7, 2025 -
[Feature]: Support Qwen model on AWS Neuron
#22445 opened
Aug 7, 2025 -
[Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings
#22444 opened
Aug 7, 2025 -
[Usage]: Whether to support benchmark service stress testing of embeddings model?
#22442 opened
Aug 7, 2025 -
[Feature]: Support Eagle Draft Model with different number of KV heads
#22432 opened
Aug 7, 2025 -
[Bug]: PP+PD NixlConnector failed
#22430 opened
Aug 7, 2025 -
[Bug]: Voxtral-Small-24B-2507 Does Not Support Pipeline-Parallel
#22424 opened
Aug 7, 2025 -
[Feature]: mxfp4 support for 3090
#22422 opened
Aug 7, 2025 -
[Feature]: Add BF16/U8 support for Apple silicon.
#22420 opened
Aug 7, 2025 -
[Bug]: Potential Integer Overflow in Operators
#22419 opened
Aug 7, 2025 -
[Bug]: mistralai/Mixtral-8x7B-Instruct-v0.1 will not load from local path
#22416 opened
Aug 7, 2025 -
[Performance]: n-gram speculative decoding drafting optimizations
#22408 opened
Aug 6, 2025 -
[Bug]: For GPT OSS 120B: Expected 2 output messages (reasoning and final), but got 7.
#22403 opened
Aug 6, 2025 -
[CI Failure]: Blackwell Test - toomanyrequests: Data limit exceeded
#22396 opened
Aug 6, 2025 -
[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'sinks'
#22383 opened
Aug 6, 2025 -
[Feature]: serve gpt-oss BF16 weights
#22380 opened
Aug 6, 2025 -
[Bug]: Model architectures ['GptOssForCausalLM'] failed to be inspected.
#22376 opened
Aug 6, 2025 -
[Bug]: process audios in pass audio in video with qwen2.5-omni-7b
#22364 opened
Aug 6, 2025 -
[Bug]: gpt-oss-120b does not support hybrid data parallel and tensor parallel
#22361 opened
Aug 6, 2025 -
[Bug]: gpt-oss-20b, oss architecture not recognised with transformers
#22353 opened
Aug 6, 2025 -
[Usage]: gpt-oss-120b tool calls
#22337 opened
Aug 6, 2025 -
[Feature]: add --reasoning_parser flag for gpt-oss
#22334 opened
Aug 6, 2025 -
[Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb)
#22331 opened
Aug 6, 2025 -
[Bug]: gpt-oss model crashes on NVIDIA B200 with any OpenAI chat completion request
#22325 opened
Aug 6, 2025 -
[Bug]: Error using openai's json_schema
#22324 opened
Aug 6, 2025 -
[Bug]: Qwen3-30B-A3B-Instruct-2507-FP deployment failed
#22307 opened
Aug 6, 2025 -
[New Model]: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B
#22298 opened
Aug 6, 2025 -
[Feature]: EP physical expert load write metrics
#22296 opened
Aug 6, 2025 -
[Feature]: Cleanup NIXLConnector Connections When Nodes Spin Down
#22295 opened
Aug 6, 2025 -
[Feature]: Tune Triton Configs for Qwen3-30B-A3-Fp8 and Bf16
#22294 opened
Aug 6, 2025 -
[Feature]: Optimize RoPE
#22293 opened
Aug 6, 2025 -
[Feature]: Make KVConnector Compatible with HMA
#22292 opened
Aug 6, 2025 -
[Bug]: gpt-oss on Ampere
#22290 opened
Aug 6, 2025 -
[Feature]: MoE DP support for AWQ/GPTQ
#22289 opened
Aug 6, 2025 -
[Bug]: The quantization method mxfp4 is not supported for the current GPU SM75
#22288 opened
Aug 5, 2025 -
[Bug]: gpt-oss-20b sometimes emits reserved tokens
#22287 opened
Aug 5, 2025 -
[Bug]: Unknown quantization method: mxfp4
#22276 opened
Aug 5, 2025 -
[Bug]: pass audios in videos in qwen2.5-omni-7b
#22268 opened
Aug 5, 2025 -
[Bug]: Running gpt-oss-120b encounteres an error with huggingface page setup
#22266 opened
Aug 5, 2025 -
[New Model]: OpenAI OSS model support
#22265 opened
Aug 5, 2025 -
[Bug]: I got an `torch._scaled_mm` error using async tp with Ampere GPU
#22250 opened
Aug 5, 2025 -
[Bug]: qwen3 moe fp8 perchannel compressed-tensors model cannot infer
#22249 opened
Aug 5, 2025 -
[RFC]: Kernel Bench Docker
#22247 opened
Aug 5, 2025 -
[RFC]: Dynamic Expert Load Balance with Zero-like-overhead
#22246 opened
Aug 5, 2025 -
[Performance]: vllm v0.10.0 seems to be much slower than vllm v0.8.5 when using Qwen3-30B-A3B-int4
#22239 opened
Aug 5, 2025 -
[Bug]: DeepSeekV3 type model fails with UnboundLocalError: cannot access local variable 'shared_output'
#22232 opened
Aug 5, 2025 -
[Usage]: Model architectures ['Qwen3ForCausalLM'] are not supported for now.
#22231 opened
Aug 5, 2025 -
[Feature]: Can speculative decoding and prefix caching take effect simultaneously?
#22230 opened
Aug 5, 2025 -
[Feature]: Qwen/Qwen3-4B-AWQ or RedHatAI/Qwen3-4B-quantized.w4a16 CPU support
#22213 opened
Aug 4, 2025 -
[Feature]: consider offering a linux/arm64 build on Docker Hub
#22206 opened
Aug 4, 2025 -
[RFC]: Integrate MPK (Mirage) compiler as an experimental execution backend to vLLM
#22201 opened
Aug 4, 2025 -
[Feature]: Support return partial rollout when abort request
#22197 opened
Aug 4, 2025 -
[Feature]: Support Mixed-Precision KV Cache Configuration
#22195 opened
Aug 4, 2025 -
[Feature]: Deterministic inference between vllm versions
#22191 opened
Aug 4, 2025 -
[Usage]: VLLM_USE_TRITON_FLASH_ATTN=0 does not enable CK flash attention
#22190 opened
Aug 4, 2025 -
[Feature]: support moe model in small gpu ram
#22189 opened
Aug 4, 2025 -
[Bug]: downgrade the wrong platform of pytorch version, should be ppc instead aarch64
#22183 opened
Aug 4, 2025 -
[Bug]: apply_temperature may cause nan in probs
#22180 opened
Aug 4, 2025 -
[Installation]: vllm 0.90 shows aimv2 error code need expert help
#22178 opened
Aug 4, 2025 -
[Usage]: Abnormal GPU usage for FP8 models on Ampere GPUs
#22177 opened
Aug 4, 2025 -
[Usage]: multi-lora for vision language model
#22162 opened
Aug 4, 2025 -
[Performance]: TTFT
#22161 opened
Aug 4, 2025
360 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Model] Pooling models default to using chunked prefill & prefix caching if supported.
#20930 commented on
Aug 10, 2025 • 44 new comments -
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel
#21716 commented on
Aug 9, 2025 • 39 new comments -
v1: Add Whisper model support (encoder-decoder)
#21088 commented on
Aug 10, 2025 • 19 new comments -
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on
Aug 9, 2025 • 17 new comments -
[V1] Logits processors extensibility
#19912 commented on
Aug 7, 2025 • 14 new comments -
LFM2
#20797 commented on
Aug 9, 2025 • 13 new comments -
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods.
#22035 commented on
Aug 10, 2025 • 10 new comments -
Update PyTorch to 2.8.0
#20358 commented on
Aug 10, 2025 • 9 new comments -
[Feature] use --eplb_config to set eplb param
#20562 commented on
Aug 9, 2025 • 9 new comments -
[Bugfix] Fix hermes tool parser handling of non-string argument types
#22002 commented on
Aug 6, 2025 • 8 new comments -
Support token_type_ids in V1 with less code changes
#21985 commented on
Aug 10, 2025 • 8 new comments -
[Model] Mamba2 varlen and metadata refactor
#21467 commented on
Aug 4, 2025 • 8 new comments -
[Speculators][Speculative Decoding] Add Eagle3 Support For HunYuan Model
#22080 commented on
Aug 9, 2025 • 7 new comments -
feat: update flashinfer ar oneshot params
#22108 commented on
Aug 10, 2025 • 6 new comments -
[Doc]: improve CPU(x86) build instructions and fix include path
#19156 commented on
Aug 10, 2025 • 6 new comments -
v1: Add Request.block_hashes
#19728 commented on
Aug 10, 2025 • 6 new comments -
[V1] feat:add engine v1 tracing
#20372 commented on
Aug 7, 2025 • 5 new comments -
[Feature] limit thinking tokens
#20859 commented on
Aug 5, 2025 • 5 new comments -
fix(completion): always include usage
#20983 commented on
Aug 6, 2025 • 5 new comments -
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer
#20059 commented on
Aug 10, 2025 • 5 new comments -
[Bugfix] Add Dense module support for sentence-transformers models
#22117 commented on
Aug 5, 2025 • 4 new comments -
Fix error message for max_input_length (bugfix of #22092)
#22094 commented on
Aug 5, 2025 • 4 new comments -
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert"
#21888 commented on
Aug 7, 2025 • 4 new comments -
[Feature] [V1] intermediate logging
#21215 commented on
Aug 8, 2025 • 3 new comments -
fix: NIXL connector transfers partial block to pass full multi-modal context
#21074 commented on
Aug 8, 2025 • 3 new comments -
[ROCm] Auto-Select Attention Backend
#21366 commented on
Aug 7, 2025 • 3 new comments -
Add support for model signature verification
#21957 commented on
Aug 7, 2025 • 3 new comments -
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on
Aug 10, 2025 • 3 new comments -
Migrate MiniCPMVImageInputs to TensorSchema
#21939 commented on
Aug 10, 2025 • 3 new comments -
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2
#21783 commented on
Aug 8, 2025 • 2 new comments -
Draft: Qwen2.5 VL eagle3
#22029 commented on
Aug 10, 2025 • 2 new comments -
[V1][Hybrid] Make KV cache layout of triton_attn compatible with hybrid models
#21624 commented on
Aug 5, 2025 • 2 new comments -
[1/N] Refactor platform API to reduce `torch.cuda` call
#20751 commented on
Aug 10, 2025 • 2 new comments -
[docs] Add FlashInfer installation guidance with torch 2.7 + cuda 12.8
#21890 commented on
Aug 5, 2025 • 1 new comment -
[Core] Hidden State Processors via plugins
#21621 commented on
Aug 7, 2025 • 1 new comment -
Migrate LlavaNextVideoPixelInputs to TensorSchema
#21843 commented on
Aug 10, 2025 • 1 new comment -
v1: Support KV events from connectors
#19737 commented on
Aug 10, 2025 • 1 new comment -
Migrate MiniMaxVL01ImageInputs to TensorSchema
#21940 commented on
Aug 7, 2025 • 1 new comment -
Add Tool Call Parser for tngtech/DeepSeek-TNG-R1T2-Chimera
#22074 commented on
Aug 4, 2025 • 1 new comment -
Qwen FP8 ModelOPT support
#21978 commented on
Aug 8, 2025 • 1 new comment -
[V1] support min_tokens for detokener
#22014 commented on
Aug 6, 2025 • 1 new comment -
[Bugfix]: Fix Promethus spec decode counter sum-of-sums
#15415 commented on
Aug 10, 2025 • 0 new comments -
[Frontend] [Bugfix] Refactor tool parsers and simplify the tool parsing interface.
#16096 commented on
Aug 9, 2025 • 0 new comments -
Switch from input to total toks/s in LLM est speed
#12908 commented on
Aug 8, 2025 • 0 new comments -
[CI/Build] Add support for Python 3.13
#13164 commented on
Aug 10, 2025 • 0 new comments -
Support R-KV Cache Compression in vLLM
#16160 commented on
Aug 8, 2025 • 0 new comments -
[P/D Disaggregation] `PDController` and `PDWorker` Prototype (1p1d)
#15343 commented on
Aug 4, 2025 • 0 new comments -
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on
Aug 10, 2025 • 0 new comments -
[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898)
#12802 commented on
Aug 10, 2025 • 0 new comments -
[Bugfix] Mistral tool parser streaming update
#19425 commented on
Aug 8, 2025 • 0 new comments -
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on
Aug 7, 2025 • 0 new comments -
[Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate` and `LLM.embed`
#18800 commented on
Aug 4, 2025 • 0 new comments -
[Bugfix] fix check kv cache memory log info
#17959 commented on
Aug 10, 2025 • 0 new comments -
[Misc][RFC] Add automated profiling sweep and heatmap visualization tools
#17933 commented on
Aug 10, 2025 • 0 new comments -
measure peak memory correctly by removing already used memory
#17872 commented on
Aug 10, 2025 • 0 new comments -
Fix NoFreeBlocksError
#17834 commented on
Aug 8, 2025 • 0 new comments -
cmake: Get rid of VLLM_PYTHON_EXECUTABLE
#17830 commented on
Aug 10, 2025 • 0 new comments -
[misc] helper for observability config
#17809 commented on
Aug 10, 2025 • 0 new comments -
Allow MambaCacheManager to use device types other than CUDA
#17779 commented on
Aug 8, 2025 • 0 new comments -
[Security] Document StatelessProcessGroup security concerns
#17591 commented on
Aug 5, 2025 • 0 new comments -
[Misc] add get kv cache token capacity
#17538 commented on
Aug 7, 2025 • 0 new comments -
[BUG] fix asymmetric `add_num_batched_tokens ` and `subtract_num_batched_tokens`
#17436 commented on
Aug 5, 2025 • 0 new comments -
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 commented on
Aug 5, 2025 • 0 new comments -
[Frontend] Added support for HermesToolParser for models without special tokens
#16890 commented on
Aug 4, 2025 • 0 new comments -
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on
Aug 7, 2025 • 0 new comments -
[V1][Spec Decode] Add random seed for EAGLE and its test script
#16235 commented on
Aug 8, 2025 • 0 new comments -
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on
Aug 10, 2025 • 0 new comments -
[RFC]: Add automated profiling sweep and heatmap visualization tools
#17823 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: Add Native Daemon Mode Support for `vllm serve`
#17847 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: Token usage is unavailable in Qwen3 streaming thought mode.
#17942 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: hope that xgrammar and vLLM v1 can offer significant inference acceleration on the RTX 4090 as well
#18517 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
#17569 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: Error of running examples/offline_inference/multilora_inference.py using vllm v0.8.3
#16576 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: [Performance] 100% performance drop using multiple lora vs no lora(qwen-chat model)
#9496 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: vllm.core.block.interfaces.BlockAllocator.NoFreeBlocksError to old Mistral Model
#11168 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: GPU OOM when processing AWQ Marlin weights with UVA CPU Offload
#21864 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
#8933 commented on
Aug 9, 2025 • 0 new comments -
[Performance]: Unexpected: B200 GPU Performance Similar to H200 for Qwen/QwQ-32B, Expected B200 to be Significantly Faster
#18725 commented on
Aug 9, 2025 • 0 new comments -
[Bug]: Incremental detokenization error when running `llama-3.3-70b-fp8` model
#21951 commented on
Aug 8, 2025 • 0 new comments -
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on
Aug 8, 2025 • 0 new comments -
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on
Aug 8, 2025 • 0 new comments -
[RFC]: vLLM configuration refactoring and modularization
#18953 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Stuck request and empty streaming for gemma3 serving with ^v0.8.5
#17658 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Inference fails on Apple silicon due to (distributed) networking error?
#18362 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al.
#20060 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Gemma3 reporting low image accuracy with v1 engine
#19763 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: sm75 can not serve qwen3 bnb 4bit model
#17337 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Prefix caching ignores visual input, causing incorrect multimodal outputs under concurrency
#20261 commented on
Aug 8, 2025 • 0 new comments -
[CI] Fix flaky CI test
#12626 commented on
Aug 5, 2025 • 0 new comments -
[Frontend] Add segments to OpenAI Requests
#11713 commented on
Aug 4, 2025 • 0 new comments -
[Feature]: Add LoRA adapter support for Gemma3nForConditionalGeneration models
#21746 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: Support Anthropic API `/v1/messages` endpoint
#21313 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: Unexpected behavior of `returned_token_ids` in Reward Modeling for LlamaForCausalLM
#16545 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: Improve Logging for Error Messages
#14083 commented on
Aug 10, 2025 • 0 new comments -
[Installation]: with latest vllm source code installation done, but failed to run vllm server
#22008 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: Support for diffusion LLM models like LLADA
#18532 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: the throughput of qwen3moe is low for prompts above 2000 tokens
#17650 commented on
Aug 10, 2025 • 0 new comments -
[Feature]: Use `QuantFp8` `CustomOp`-abstraction for MoE layers
#20711 commented on
Aug 10, 2025 • 0 new comments -
[Performance]: Inefficient prefill attention compared to HuggingFace
#20174 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: The value of --max-model-len may influence results although the length of input less than max-model-len
#11447 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: v0.7.3 can't work on wsl-ubuntu mirrored network
#13656 commented on
Aug 10, 2025 • 0 new comments -
[New Model]: Google SigLip 2
#13663 commented on
Aug 10, 2025 • 0 new comments -
[Installation]: Attempting to build and run vLLM for Intel Core Ultra 7 155H with ARC iGPU
#14295 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: 0.8.0(V1) Ray cannot find model pyarrow and pandas
#15100 commented on
Aug 10, 2025 • 0 new comments -
[Usage]: How to make the Reasoning of deepseek output normally and the final content structured output
#15618 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: Unable to run Qwen3 on Turing GPUs after upgrading to torch 2.7.0
#17639 commented on
Aug 10, 2025 • 0 new comments -
[Bug]: KeyError: 'layers.11.shared_transformer.self_attn.qkv_proj.weight' for Zamba2 after finetuning
#17755 commented on
Aug 10, 2025 • 0 new comments -
Migrate Mistral3ImagePixelInputs to TensorSchema
#21945 commented on
Aug 7, 2025 • 0 new comments -
[Bugfix] Ensure the system ulimit is high enough to support the concurrency in benchmark
#21938 commented on
Aug 4, 2025 • 0 new comments -
Fix #21840. Fix tool_call parsing edge cases
#21930 commented on
Aug 4, 2025 • 0 new comments -
[Bugfix] Fix port conflict by obtaining a list of open ports upfront
#21894 commented on
Aug 8, 2025 • 0 new comments -
[Bugfix] Fix PyNcclCommunicator device assertion for un-indexed CUDA devices
#21869 commented on
Aug 5, 2025 • 0 new comments -
[V0 Deprecation] [P/D] Move `kv_connector/v1` to `kv_connector` (2/2)
#21855 commented on
Aug 5, 2025 • 0 new comments -
[WIP] Add Kimi-Audio integration for vLLM
#21849 commented on
Aug 8, 2025 • 0 new comments -
Migrate MiniCPMOAudioInputs to TensorSchema
#21847 commented on
Aug 7, 2025 • 0 new comments -
Migrate LlavaOnevisionMultiInputs to TensorSchema
#21844 commented on
Aug 9, 2025 • 0 new comments -
Fix kvcache mismatch issue in vllm v0 kv_connector
#21817 commented on
Aug 5, 2025 • 0 new comments -
[Bugfix] Scheduler: only schedule prefill chunks when entire context fits
#21809 commented on
Aug 7, 2025 • 0 new comments -
Migrate LlavaImageInputs to TensorSchema
#21770 commented on
Aug 10, 2025 • 0 new comments -
[Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models
#21737 commented on
Aug 7, 2025 • 0 new comments -
[EP]dynamic Eplb metrics
#21732 commented on
Aug 4, 2025 • 0 new comments -
Limit concurrent long partial prefills via max_long_partial_prefills
#21651 commented on
Aug 8, 2025 • 0 new comments -
[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend
#21549 commented on
Aug 10, 2025 • 0 new comments -
[Perf] Support silu_and_mul vectorization
#21521 commented on
Aug 4, 2025 • 0 new comments -
[Benchmark] Add expert parallel support to MoE benchmark
#20876 commented on
Aug 4, 2025 • 0 new comments -
[V0 Deprecation] Remove multi-step scheduling
#22138 commented on
Aug 9, 2025 • 0 new comments -
Update rms_norm_kernel by removing redundant global memory loads
#22134 commented on
Aug 10, 2025 • 0 new comments -
[Bugfix] Add num_special_tokens_to_add to MistralTokenizer, fixes #22013
#22121 commented on
Aug 4, 2025 • 0 new comments -
vLLM Benchmark suite improvement
#22119 commented on
Aug 9, 2025 • 0 new comments -
[Fix] Fix python path resolving in cpu cmake
#22115 commented on
Aug 6, 2025 • 0 new comments -
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar
#22112 commented on
Aug 6, 2025 • 0 new comments -
enable Docker-aware precompiled wheel setup
#22106 commented on
Aug 6, 2025 • 0 new comments -
Enable EPLB on ernie4.5-moe
#22100 commented on
Aug 5, 2025 • 0 new comments -
[V1] Enhanced Exception Handling for KV Cache Loading from Remote Store
#22075 commented on
Aug 4, 2025 • 0 new comments -
[Core] Enable HF processing on GPU
#22070 commented on
Aug 9, 2025 • 0 new comments -
Update FlashAttention to latest commit
#22030 commented on
Aug 7, 2025 • 0 new comments -
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm
#22017 commented on
Aug 9, 2025 • 0 new comments -
[Structured Output][Refactor] Move `apply_grammar_bitmask()` method from `ModelRunner` to structured output utils
#21999 commented on
Aug 8, 2025 • 0 new comments -
[Disagg][Perf] Add env var to allow gpu model work runs in non-default CUDA stream, improving disagg TTIT/TTFT
#21988 commented on
Aug 7, 2025 • 0 new comments -
[torchao] Support quantization configs using module swap
#21982 commented on
Aug 9, 2025 • 0 new comments -
[CI/Build] add EP dependencies to docker
#21976 commented on
Aug 7, 2025 • 0 new comments -
[Build] Add FlashInfer wheel build to release pipeline
#21975 commented on
Aug 6, 2025 • 0 new comments -
[Feature] Add `VLLM_USE_DEEP_GEMM_E8M0` Env to Control E8M0 Scale
#21968 commented on
Aug 10, 2025 • 0 new comments -
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on
Aug 7, 2025 • 0 new comments -
[Bugfix] Fix the bug in Hermes streaming parsing
#20824 commented on
Aug 5, 2025 • 0 new comments -
[WIP] [Feature]: LoRA for vision modules
#20787 commented on
Aug 4, 2025 • 0 new comments -
[PERF] Symmetric memory allreduce
#20759 commented on
Aug 7, 2025 • 0 new comments -
[feat] backup memory except model parameters when using level=2 in sleep mode
#20735 commented on
Aug 4, 2025 • 0 new comments -
PrefixRepetitionRandomDataset
#20638 commented on
Aug 8, 2025 • 0 new comments -
feat: Add streaming support for Mistral v11 tool format
#20503 commented on
Aug 6, 2025 • 0 new comments -
[Installation] Fix python only installation wheel packaging missing libs
#20351 commented on
Aug 9, 2025 • 0 new comments -
[Do not merge] Add out of place layernorm
#20197 commented on
Aug 7, 2025 • 0 new comments -
v1: Introduce LRU-based CPU offloading management
#20075 commented on
Aug 5, 2025 • 0 new comments -
Add support for token_type_ids
#19988 commented on
Aug 5, 2025 • 0 new comments -
[Core] Track expert selection metrics
#19915 commented on
Aug 4, 2025 • 0 new comments -
v1: Introduce an offloading component
#19848 commented on
Aug 4, 2025 • 0 new comments -
Triton-fused DeepseekScalingRotaryEmbedding
#19771 commented on
Aug 8, 2025 • 0 new comments -
Workaround for an integer overflow with large CHUNK_SIZE
#19770 commented on
Aug 8, 2025 • 0 new comments -
BLOCK_SIZE_K fix
#19769 commented on
Aug 8, 2025 • 0 new comments -
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 commented on
Aug 8, 2025 • 0 new comments -
[V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1
#21506 commented on
Aug 10, 2025 • 0 new comments -
[v1][spec decode] Run eagle with full cudagraph support
#21477 commented on
Aug 10, 2025 • 0 new comments -
v1/offloading: Add worker-side CPU support
#21448 commented on
Aug 6, 2025 • 0 new comments -
Updates to Flex + VLLm integration
#21416 commented on
Aug 8, 2025 • 0 new comments -
[wip] add nccl allocator and symm memory and enable TP all reduce for nccl symm
#21383 commented on
Aug 5, 2025 • 0 new comments -
[feat] Support EAGLE for Qwen2
#21363 commented on
Aug 4, 2025 • 0 new comments -
[Feature][EPLB] Add support for Qwen3 EPLB
#21290 commented on
Aug 5, 2025 • 0 new comments -
[Fix] correct tool_id for kimi-k2 when use tool_choice=required
#21259 commented on
Aug 7, 2025 • 0 new comments -
Make async scheduling compatible with DP
#21244 commented on
Aug 8, 2025 • 0 new comments -
ci: Add CUDA + arm64 release builds
#21201 commented on
Aug 6, 2025 • 0 new comments -
[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel
#21197 commented on
Aug 4, 2025 • 0 new comments -
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on
Aug 7, 2025 • 0 new comments -
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing
#21161 commented on
Aug 4, 2025 • 0 new comments -
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on
Aug 7, 2025 • 0 new comments -
Enable sequence parallelism for full cuda graph without specifying compile sizes
#21031 commented on
Aug 10, 2025 • 0 new comments -
[Meta] Unshift eagle prefill and support draft kv sharing from base
#21008 commented on
Aug 8, 2025 • 0 new comments -
fix: Handle unsupported message fields in tool calling
#20973 commented on
Aug 4, 2025 • 0 new comments -
Add add_logger API to AsyncLLM
#20952 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Large Data Parallel Size Cause Loading Safetensors Extremely Slow
#17783 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: How to reproduce the results of `vllm` using `transformers`
#21433 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: ERNIE-4.5 does not run on an RTX Pro 6000 Blackwell
#20712 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: When inferring Qwen3-32B-AWQ with vllm0.9.2, an error message appears: Quantization scheme is not supported
#20216 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: return graceful inference text input validation errors as part of output (without throwing an exception) - to enable skipping / handling bad examples after the processing of good ones
#16732 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: try to gracefully destroy process group in `vllm serve` on handling Ctrl+C (prior to processes termination)
#19196 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8)
#21466 commented on
Aug 5, 2025 • 0 new comments -
[Installation]: Docker image build fails for Apple Silicon using Dockerfile.cpu
#21714 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on
Aug 5, 2025 • 0 new comments -
[RFC] Run HF processing on GPU
#21995 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: GLM-4.5 not working
#22140 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4
#20468 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: 使用Vllm如何对Qwen3ForSequenceClassification模型进行文本分类加速?
#19950 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model.
#21986 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: LoRA support for qwen2-vl Models
#11255 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: QTIP Quantization
#11416 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: Automatic Prefix Cache life cycle
#12077 commented on
Aug 5, 2025 • 0 new comments -
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: Segmentation Fault caused by model indexing errors (token sequence length exceeding 16384) in vLLM 0.7.3 multi-node deployment for DeepSeek R1 67B
#14652 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: v0.8.1 V1 with pipeline-parallel-size 4, weird responses
#16068 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Problems with vllm serve DeepSeek-R1 with 2 nodes and TP = 16(include vllm v0.8.4 v0.7.3 v0.7.2 V0 V1 engine)
#16692 commented on
Aug 5, 2025 • 0 new comments -
[Doc]: state requirements for testing or update to work for CPU-only
#16920 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: Is it possible to use CUDA Graph during the encoding for encoder-decoder models?
#17789 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: 自己部署vllm,无法调用工具,需要开启--enable-auto-tool-choice,开启后提示要配置--chat-template-content-format,最后报错
#17792 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: How to output metrics information from vllm?
#17795 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: how to return attention_weight logits in page_attention
#17796 commented on
Aug 6, 2025 • 0 new comments -
[Installation]: How to deploy docling model on vllm
#17807 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Disaggregated Prefill in vLLM 0.8.3 Produces Incorrect/Unreasonable Outputs
#17808 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: Deploy EasyOCR , Docling models on vllm
#17814 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: vllm 0.8.5.dev468+g98834fefa.precompiled OOM on Qwen3-32B with 1 lora module
#17822 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Enabling EPLB leads to inconsistent inference results
#21606 commented on
Aug 6, 2025 • 0 new comments -
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Dynamic loading LoRA is not working properly
#18372 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: tool call parameters doesn't respect schema for parameters when Streaming=True and tool_choice="auto"
#21756 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Usage of VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 in V1 likely to cause a crash
#17924 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: [P/D] Expose kv_transfer metrics (print to console, and to promethus)
#21784 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: The class UnquantizedLinearMethod must implement the 'embedding' method
#22111 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: When running phi-4-reasoning-plus with vLLM, the model gets stuck repeating reasoning phrases
#18141 commented on
Aug 5, 2025 • 0 new comments -
[RFC]: Optimize Input Media Processing in vLLM
#22044 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Tool call argument value of type `integer` may break things when `stream=True`
#21372 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: Vllm whisper model response_format verbose_json not working
#14818 commented on
Aug 5, 2025 • 0 new comments -
[RFC]: vLLM-compile (minus cudagraphs) warm-start time should be close to zero
#20402 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#21661 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: swap_blocks and copy_blocks functions are wrong in flashinfer.py
#17362 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs
#15338 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Facing run time error after building vllm cpu from source
#21935 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: ValueError: There is no module or parameter named 'lm_head' in Gemma3nForConditionalGeneration
#21755 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Single-Node EP Inference Failure on DeepSeek with PPLX/DeepGEMM Backend
#22039 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_0': 1}
#21882 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: An error occurred when using Eagle3 to load the Qwen3 series.
#22152 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Processor mismatch between what is provided by OpenGVLab and VLLM for InternVL leading to outputs of the processor being too large to be decoded for the tokenizer
#21899 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Qwen2.5-VL + LoRA returns different results for same input on H20
#22057 commented on
Aug 4, 2025 • 0 new comments -
[Feature]: Multimodal Benchmarking Support (MMLM)
#21887 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: vllm.LLM does not seem to re-initialize for distributed inference with subsequent models with Offline Inference
#9727 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made
#15483 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Use the latest version of the inference model and use API calls to report errors.(V0.8.5)
#17430 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: failed to run LMCache example for v0
#17545 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request.
#17609 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Qwen2.5-vl-7B stuck after loading weight and use a lot of shared GPU memory
#17611 commented on
Aug 4, 2025 • 0 new comments -
[Feature]: How to enable an LLM to simultaneously provide OpenAI API-compatible /v1/completions and /v1/embeddings services
#17627 commented on
Aug 4, 2025 • 0 new comments -
[Usage]: vLLM on multiple node GPUs
#17645 commented on
Aug 4, 2025 • 0 new comments -
[Feature]: Support for streaming N tokens at a time in AsyncLLMEngine
#17681 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: A800 GPU set VLLM_USE_V1=1 ValueError: No available memory for the cache blocks
#17431 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: support HTTP/2.0?
#17695 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Required fields Qwen2-VL missing "pixel_values"
#17696 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: Addition of pre-built AMD wheel packages
#17697 commented on
Aug 5, 2025 • 0 new comments -
[Usage]: How to limit the thinking budget for reasoning mode
#17700 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: The v1 engine does not support `add_logger`.
#17702 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Qwen3-30B-A3B-FP8 fails to run on 2*3090
#17708 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Slight Embedding Precision Difference When Running bge-m3 in vLLM Compared to Original Model
#17713 commented on
Aug 5, 2025 • 0 new comments -
[RFC]: Enabling Arm Neoverse CI Runners
#17720 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: Does vLLM allow 'dropping' requests instead of preempting them?
#17736 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Interrupting inference with ctrl-c causes future requests to hang
#17738 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: token_type_ids lost from prompt input during asynchronous request processing
#17743 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: vllm==0.10.0 + flashinfer, MultiLevelCascadeAttentionWrapper.plan() got an unexpected keyword argument 'kv_data_type'
#21822 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Mistrall Small 3.2 doesn't work with images
#20025 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on
Aug 5, 2025 • 0 new comments -
[Feature]: Audit and Update Examples To Use `VLLM_USE_V1=1`
#14530 commented on
Aug 5, 2025 • 0 new comments -
[Bug]: [v1/core/block_pool.py] Assertion Failure: prev_block.block_hash is not None
#21992 commented on
Aug 4, 2025 • 0 new comments -
[RFC]: Multi-modality Support on vLLM
#4194 commented on
Aug 4, 2025 • 0 new comments -
[Usage]: Kubernetes Offline Model Usage
#22071 commented on
Aug 4, 2025 • 0 new comments -
[Feature]: Simple Data Parallelism in vLLM
#9206 commented on
Aug 4, 2025 • 0 new comments -
[RFC][Feature]: Unified Auto-Selection Mechanism for Attention Backends
#21805 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Verify that the `min_tokens` sampling parameter is working and covered by CI tests
#21950 commented on
Aug 4, 2025 • 0 new comments -
[Bug]: Error when loading EAGLE3 weight, yuhuili/ EAGLE3-LLaMA3.1-Instruct-8B
#19991 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Offline inference data parallel significantly slower in 0.8.2 than 0.6.4.post1 and 0.7.2
#17685 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Qwen3 30b a3b awq not working with vllm docker v0.8.5.post1
#17739 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: when vll send a low pictures, will be broken.
#17769 commented on
Aug 8, 2025 • 0 new comments -
[Usage]: Inquiry About AMD APU Support (e.g., AMD AI Max+ 395) and Handling in vLLM
#17843 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Can't use GPTQ model with weight_zero_point
#17862 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Adding attention mask to vllm.attention.Attention
#17869 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: V1 on AMD MI300A complains that cupy is not present
#17875 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: vllm run bge-m3 error
#17877 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: object has no attribute 'finished_req_ids'
#17881 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Model's test performance degrade significantly when I use vllm to deploy it with a concurrency number exceeding 5
#17886 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Qwen/QwQ-32B crashed,but sglang work good! occurs an illegal memory access was encountered
#17889 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Regarding the CUDA error/no_thinking that occurred during pressure testing
#17893 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Tensor parallelism for GLM-4.5
#22126 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: v0.10.0 built with early version of pytorch that does not support sm-120
#21633 commented on
Aug 7, 2025 • 0 new comments -
[Feature] Skip modules for disabled modalities
#21943 commented on
Aug 7, 2025 • 0 new comments -
[RFC]: KV cache offloading
#19854 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: Potential Integer Overflow permute_cols.cu
#19450 commented on
Aug 7, 2025 • 0 new comments -
[Usage]: Can vllm skip MTP layer loading for GLM-4.5 to save some vram
#22120 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: disaggregated prefilling hangs when TP=2
#11247 commented on
Aug 7, 2025 • 0 new comments -
[Usage]: Triton compilation error (f16 to f16 conversion) on Tesla T4 with Qwen2.5-0.5B-Instruct and LoRA
#20259 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: 'NoneType' object has no attribute 'sampled_token_ids' for DP 2 PP 2
#22062 commented on
Aug 7, 2025 • 0 new comments -
[Usage]: When I use the Qwen3-32B with tool_choice='required' parameter, the tool calling gets stuck in a loop
#21026 commented on
Aug 8, 2025 • 0 new comments -
[Usage]: [0.8.5v1+1P1D+LMCACHE] Is the Prefill instance running queue limited to processing only one request?
#18952 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Add Triton implementation of NVFP4 GEMM
#21014 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Fix MRL Support Detection for Qwen3-Embedding-8B Model (It Supports MRL per Latest Official Docs)
#20899 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Dynamic Chunked Pipeline Parallelism
#20808 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: TTFT increased especially in some Distill Models with small BatchSize in v0.10.0 compared to v0.9.2
#21983 commented on
Aug 8, 2025 • 0 new comments -
[Usage] Qwen3 Usage Guide
#17327 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Empty VllmConfig when calling `get_current_vllm_config`, causing VllmConfig `__post__init__` to fail
#21134 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Support structured output and tool call together
#16313 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: vLLM Server Crash with CUDA Memory Error when serving `gemma-3-27b-it-FP8-Dynamic`
#21708 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: vllm.third_party.pynvml.NVMLError_InvalidArgument: Invalid Argument
#19071 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: will whisper add language detection?
#14174 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Qwen2 MoE inference is super slow
#15470 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: CPU offload not working for vllm serve
#15877 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Is V1 Enigne ready for DeepSeek-V1/R1 ?
#16442 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: RuntimeError: operator _C::machete_gemm does not exist
#16810 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: DataParallel on multinode unable to start GPU
#16957 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: raise NotImplementedError
#17086 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: Potential memory leak: VRAM continuously increases and not freed with deepseek-r1 on vLLM v1 engine
#17243 commented on
Aug 8, 2025 • 0 new comments -
[Usage]: CUDA Error with Qwen3-32B Model When Processing larger tokens it leads to model went to non responsive condition / stability concerns
#17534 commented on
Aug 8, 2025 • 0 new comments -
[Feature]: Implement vAttention: Virtual Memory Management for KV Cache on NVIDIA GPUs
#17612 commented on
Aug 8, 2025 • 0 new comments -
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: min_tokens is not respected when stop is triggered early
#21987 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: [V1] Misleading Error Messages
#13510 commented on
Aug 6, 2025 • 0 new comments -
[Frontend] Combine microbatch tokenization with multi-modal processing
#21949 commented on
Aug 6, 2025 • 0 new comments -
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Support Inflight quantization: load as 8bit quantization.
#11655 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: Can AsyncLLMEngine support batch infer?
#14717 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Design flaws in the current tool parser.
#15177 commented on
Aug 6, 2025 • 0 new comments -
[New Model]: Support for SFR-Embedding-Code-2B_R embbeding model
#15362 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: RequestMetrics object (accessed through output[0].metrics) is None
#15394 commented on
Aug 6, 2025 • 0 new comments -
[Performance]: Update Cascade Attention Heuristics for FA3
#15647 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: H20*TP16,can't start service, get error: Cannot allocate memory
#16142 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: vLLM still runs after Ray workers crash
#16259 commented on
Aug 6, 2025 • 0 new comments -
[Feature Request]: Support data_parallel_size in offline inference mode
#16588 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: [v1][Spec Dec] Specifying draft TP does not have any impact.
#17499 commented on
Aug 6, 2025 • 0 new comments -
[RFC]: Deprecating vLLM V0
#18571 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Can't serve can we serve Q4_K_M-GGUF Model
#17661 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: Offline multi-node inference
#17711 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Support for OpenGVLab/InternVL3-38B-AWQ
#17734 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Support quantization for pooling model which does embedding.
#17760 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: How to Truncate multi-modal tokens
#17765 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: Logits processing with Lora is incorrect
#17766 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Support for IBGDA
#17774 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Qwen3 Models GGUF Support
#21511 commented on
Aug 7, 2025 • 0 new comments -
[Feature]: Attention-FFN disaggregation
#21644 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: AWQ fails on MoE models
#22004 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: Failing to initialize engine on qwen3 on B200 with VLLM_USE_DEEP_GEMM=1
#21542 commented on
Aug 7, 2025 • 0 new comments -
invalid conversion from ‘int’ to ‘CUresult’ {aka ‘cudaError_enum’}
#17931 commented on
Aug 7, 2025 • 0 new comments -
[Bug]: LLaMa 3.1 8B/70B/405B all behave poorly and differently using completions API as compared to good chat API
#7382 commented on
Aug 7, 2025 • 0 new comments -
[RFC]: Prompt Embeddings Support in v1 Engine
#22124 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: some error when training in ray
#20431 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: can not support InternVL3-78B-AWQ
#21695 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: GLM-4.1V lora trained model reports target_module mismatch error
#22077 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: OpenReasoning-Nemotron-32B Only Outputs Exclamation Marks Regardless of Input
#21292 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: can not use uv run or uv run python on mac series
#15985 commented on
Aug 6, 2025 • 0 new comments -
[Feature]: Any plans to support TokenWeave optimizations in vLLM?
#20223 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#20226 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: AttributeError: 'OvisConfig' object has no attribute 'num_attention_heads'
#17646 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: How to implement the inference test of LLM model PD (Prefill-Decode) disaggregation using the vllm framework ?
#21800 commented on
Aug 6, 2025 • 0 new comments -
Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
#2248 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: cuda version of `vllm/vllm-openai:latest` older than k8s node cuda 12.9 Incompatibility error
#21979 commented on
Aug 6, 2025 • 0 new comments -
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: all2all communication hangs when using DeepEP and PPLX for v0.9.2
#21306 commented on
Aug 6, 2025 • 0 new comments -
[RFC]: Unification of frontend parser
#17817 commented on
Aug 6, 2025 • 0 new comments -
[Bug]: VLLM 0.10.0 breaks quantized models batch inference speed for Qwen2.5-VL-7B (tested multiple quantization types)
#21689 commented on
Aug 6, 2025 • 0 new comments