VOOZH
about
URL: https://dev.to/t/vllm
⇱ Vllm - DEV Community
AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
👁 thecybersidekick profile
The Cyber Sidekick
👁 Image
The Cyber Sidekick
Jun 18
AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
#
edgeai
#
kubernetes
#
llminference
#
vllm
Add Comment
3 min read
Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out
👁 creeta profile
Creeta
👁 Image
Creeta
Jun 18
Qwen3.6-35B NVFP4 runs on one H100 — A100 owners are out
#
qwen3
#
nvfp4
#
vllm
#
nvidia
Add Comment
8 min read
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
👁 gaearuiw profile
GaeaRuiW
👁 Image
GaeaRuiW
Jun 9
I built an open-source alternative to Microsoft's KAITO that works on ANY Kubernetes cluster
#
kubernetes
#
vllm
#
devops
#
opensource
Add Comment
2 min read
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
👁 tech_nuggets profile
Tech_Nuggets
👁 Image
Tech_Nuggets
Jun 7
Prefix caching at scale: when it saves you 80% of prefill cost, and the eviction policies that quietly turn it into 5%
#
llm
#
ai
#
infrastructure
#
vllm
Add Comment
9 min read
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
👁 tech_nuggets profile
Tech_Nuggets
👁 Image
Tech_Nuggets
Jun 6
KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break
#
llm
#
ai
#
vllm
#
performance
👁 Image
1
reaction
Add Comment
8 min read
Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding
👁 ric03uec profile
Devashish
👁 Image
Devashish
Jun 16
Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding
#
localllm
#
vllm
#
ai
#
nvidia
Add Comment
5 min read
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
👁 Google Developer Experts logo
👁 xbill profile
xbill
👁 Image
xbill
for
Google Developer Experts
May 30
Gemma 4 Benchmarking NVIDIA Blackwell RTX 6000 vs L4 on Google Cloud Run
#
googleantigravity
#
vllm
#
googlecloudrun
#
gemma4
👁 Image
👁 Image
👁 Image
4
reactions
Add Comment
14 min read
vLLM's V1 Release Fixes the Silent Killer in RL Training
👁 o96a profile
Aamer Mihaysi
👁 Image
Aamer Mihaysi
May 8
vLLM's V1 Release Fixes the Silent Killer in RL Training
#
vllm
#
machinelearning
#
python
Add Comment
2 min read
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
👁 glad_labs profile
Matthew Gladding
👁 Image
Matthew Gladding
Apr 24
The 70B Threshold: How the RTX 5090 Rewrites the Home Lab Equation
#
model
#
memory
#
models
#
vllm
Add Comment
8 min read
How RunPod FlashBoot Actually Works (4-Request Test)
👁 sergeyshmakov profile
Sergey Shmakov
👁 Image
Sergey Shmakov
May 26
How RunPod FlashBoot Actually Works (4-Request Test)
#
runpod
#
flashboot
#
serverless
#
vllm
👁 Image
1
reaction
Add Comment
10 min read
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
👁 1grace profile
Grace
👁 Image
Grace
May 21
Rethinking Open Source Contribution in the Age of AI Agents, featuring vLLM Core Maintainer Roger Wang at MLSys'26
#
vllm
#
ai
#
machinelearning
#
llm
👁 Image
👁 Image
👁 Image
8
reactions
6
comments
3 min read
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
👁 thurmon_demich profile
Thurmon Demich
👁 Image
Thurmon Demich
May 20
Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?
#
ollama
#
llamacpp
#
vllm
#
comparison
1
comment
5 min read
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X
👁 manikandan_t_6d72e32ac4e8 profile
Manikandan T
👁 Image
Manikandan T
May 13
72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X
#
vllm
#
rocm
#
mi300x
#
genai
Add Comment
13 min read
From one model to seven — what it took to make TurboQuant model-portable
👁 albertocodes profile
Alberto Nieto
👁 Image
Alberto Nieto
Apr 1
From one model to seven — what it took to make TurboQuant model-portable
#
python
#
vllm
#
gpu
#
triton
Add Comment
3 min read
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1
👁 albertocodes profile
Alberto Nieto
👁 Image
Alberto Nieto
Mar 28
Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1
#
python
#
vllm
#
gpu
#
containers
👁 Image
1
reaction
Add Comment
2 min read
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
👁 DEV Community
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account
👁 Image
👁 Image
👁 Image
👁 Image
👁 Image