Voozh

3 min read

👁 creeta profile

Creeta

Jun 18

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

#qwen3 #nvfp4 #vllm #nvidia

Add Comment

8 min read

👁 gaearuiw profile

GaeaRuiW

Jun 9

I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster

#kubernetes #vllm #devops #opensource

Add Comment

2 min read

👁 tech_nuggets profile

Tech_Nuggets

Jun 7

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

#llm #ai #infrastructure #vllm

Add Comment

9 min read

👁 tech_nuggets profile

Tech_Nuggets

Jun 6

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

#llm #ai #vllm #performance

👁 Image
1 reaction

Add Comment

8 min read

👁 ric03uec profile

Devashish

Jun 16

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding

#localllm #vllm #ai #nvidia

Add Comment

5 min read

👁 Google Developer Experts logo
👁 xbill profile

xbill

for Google Developer Experts

May 30

Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

#googleantigravity #vllm #googlecloudrun #gemma4

👁 Image
👁 Image
👁 Image
4 reactions

Add Comment

14 min read

👁 o96a profile

Aamer Mihaysi

May 8

vLLM's V1 Release Fixes the Silent Killer in RL Training

#vllm #machinelearning #python

Add Comment

2 min read

👁 glad_labs profile

Matthew Gladding

Apr 24

The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation

#model #memory #models #vllm

Add Comment

8 min read

👁 sergeyshmakov profile

Sergey Shmakov

May 26

How RunPod FlashBoot Actually Works (4-Request Test)

#runpod #flashboot #serverless #vllm

👁 Image
1 reaction

Add Comment

10 min read

👁 1grace profile

Grace

May 21

Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26

#vllm #ai #machinelearning #llm

👁 Image
👁 Image
👁 Image
8 reactions

6 comments

3 min read

👁 thurmon_demich profile

Thurmon Demich

May 20

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

#ollama #llamacpp #vllm #comparison

1 comment

5 min read

👁 manikandan_t_6d72e32ac4e8 profile

Manikandan T

May 13

72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X

#vllm #rocm #mi300x #genai

Add Comment

13 min read

👁 albertocodes profile

Alberto Nieto

Apr 1

From one model to seven — what it took to make TurboQuant model-portable

#python #vllm #gpu #triton

Add Comment

3 min read

👁 albertocodes profile

Alberto Nieto

Mar 28

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

#python #vllm #gpu #containers

👁 Image
1 reaction

Add Comment

2 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

URL: https://dev.to/t/vllm

⇱ Vllm - DEV Community

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out

I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster

Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding

Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run

vLLM's V1 Release Fixes the Silent Killer in RL Training

The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation

How RunPod FlashBoot Actually Works (4-Request Test)

Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X

From one model to seven — what it took to make TurboQuant model-portable

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1