VOOZH about

URL: https://thenewstack.io/netease-fluid-llm-inference/

⇱ How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-05-06 09:00:00
How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds
sponsor-cncf,sponsored-post-contributed,
AI Infrastructure / Kubernetes / Large Language Models

How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds

NetEase Games cut LLM cold-start times from 42 mins to 30 sec with the CNCF Fluid project, enabling serverless GPU inference on Kubernetes.
May 6th, 2026 9:00am by Haifeng Liao and Xiang Zhang
👁 Featued image for: How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds
Ardian Pranomo for Unsplash+
CNCF sponsored this post.

At NetEase Games, we learned a hard lesson about large language model (LLM) inference in production: elastic compute is only useful if data can move just as fast.

“Elastic compute is only useful if data can move just as fast.”

On paper, serverless GPU infrastructure looked like a good fit for inference workloads. Game traffic is bursty, peaks differ by title and time of day, and reserving GPU capacity for every possible spike is expensive. But once we started scaling LLM services across regions, a different bottleneck emerged. The real problem was not scheduling containers. It was loading model data.

For 70B-class models, pulling hundreds of gigabytes of weights from remote storage into inference nodes could take tens of minutes. That erased the value of autoscaling. In one representative workload, model load time was reduced from 42 minutes with cross-region direct storage access to 14 minutes with a traditional Alluxio-based cache and then to 3 minutes after we enabled Fluid’s prefetching workflow. That difference turned serverless inference from an architectural idea into something we could actually operate.

The Day 2 problem: Cold starts, shared models, and fragmented GPU capacity

Our AI platform, Tmax, runs on Kubernetes and supports the full ML lifecycle, from notebook-based development to training and inference deployment. As LLM usage increased across game-related scenarios — including intelligent NPCs, content generation, and internal AI services — three operational problems became tightly coupled.

First, GPU resources were scarce and heterogeneous. Different workloads require different card types, memory sizes, and scaling patterns. Keeping enough GPU capacity online for peak demand across every team was inefficient.

Second, inference traffic was not uniform. Some titles peaked in the evening, others during the day. Some workloads were latency-sensitive online inference; others were batch jobs or fine-tuning tasks. Static provisioning drove utilization down and waste up.

Third, serverless cold starts were dominated by model loading. Even when computing resources became available quickly, the model’s data path remained slow. The result was an expensive system that still could not respond to traffic spikes in time.

This is where “Day 2” operations got interesting. The question was no longer how to deploy inference services. It was how to keep model access fast, consistent, and manageable across regions and namespaces over time.

Why we didn’t just run Alluxio directly

What we needed was a Kubernetes-native way to define datasets, prewarm them, mount them into workloads, and share them safely across namespaces. We also needed the runtime layer to scale in step with application behavior.

That higher-level abstraction was the main reason for choosing Fluid, a Cloud Native Computing Foundation (CNCF) incubating project. With Fluid, the operational unit is not just a cache cluster. It’s a dataset and runtime. This configuration maps better to how platform teams actually manage model-serving infrastructure.

👁 Infographic showing storage perspective of K8s and data usage perspective of Fluid

Fluid: Adding operational control to Alluxio

DimensionChallenges With Running Alluxio DirectlyWhat Fluid Added
Integration with KubernetesAlluxio master and worker clusters had to be deployed and managed separately, with limited alignment to Kubernetes-native lifecycle and scheduling behavior.Fluid automated runtime deployment and lifecycle management, supported cache elasticity through mechanisms such as HPA/KEDA, and made it easier to align compute placement with cached data through data-aware scheduling.
LLM inference-specific optimizationGeneral-purpose caching improved access times, but loading large models still required custom warmup logic and additional operational work.Fluid provided prefetch workflows for scheduled, event-driven, and proactive warm-up. It also lets us optimize for framework-specific access behavior, including vLLM and SGLang-style model-loading patterns, and scale the cache down again after deployment when appropriate.
Data abstraction and runtime decouplingA direct deployment model tied operations more closely to a single cache implementation, making long-term evolution harder.Fluid separated the dataset abstraction from the runtime layer. That allowed us to maintain a stable operational model while retaining the option to switch runtimes over time, such as Alluxio, JindoCache, or JuiceFS.
Isolation and sharing across teamsMulti-team sharing required more manual namespace, quota, and configuration design, especially when common base models had to be reused safely.Fluid supported dataset-level logical isolation and cross-namespace sharing, with access control aligned to native Kubernetes mechanisms.
Support for heterogeneous compute environmentsDeploying and managing the same data access model across environments, such as serverless containers, was more difficult and usually required additional integration work.Fluid supported both CSI- and Sidecar-based access patterns. Webhook-based Sidecar injection reduced the amount of application-side change needed to use the same model-loading path across environments.

Fluid  also made a few common patterns easier for us to:

  • Prefetch before startup so inference Pods do not pay the full cold-start penalty at runtime.
  • Schedule scale-up and warm-up for workloads with predictable traffic windows.
  • Cross-namespace models share a common base; they do not have to be repeatedly cached by each team.

The last point mattered more than we expected. In a multi-tenant platform, repeated caching of the same model wastes memory and creates version-management overhead. Fluid lets us maintain shared models in a single namespace and expose them to application teams via references rather than duplicate runtime stacks.

What changed in production

The result was not a small tuning improvement. It changed whether elastic inference was practical for us.

In an earlier benchmark path, model load time dropped from 42 minutes with cross-region direct access to 14 minutes with a conventional cache layer, and then to 3 minutes after enabling Fluid-based prefetching. After further tuning in production, the startup time for two model inference services was reduced to about one minute and, in some cases, even under 30 seconds.

👁 Model load time comparison between Cross-region access, Alluxio cache, and Fluid prefetch

The significant reduction in latency led to a corresponding reduction in cost, allowing us to scale GPU resources more aggressively during quiet periods.

The cache-sharing model also reduced waste. Instead of caching the same foundation model separately for each namespace, we could warm it once and let multiple services consume it. That lowered cache memory overhead and simplified operations for platform teams.

Just as important was the distributed cache that helped absorb startup bursts. When many inference Pods were launched together, the platform no longer pushed all of that pressure directly onto the backend storage path.

A useful way to frame the choice

For us, the comparison was not really “Fluid versus Alluxio” as competing products. It was a choice between solving a narrow problem and solving the operational one.

If the requirement is simply to put a cache in front of remote storage, running Alluxio directly may be enough. If the requirement is to operate LLM inference on Kubernetes over time — with prefetching, sharing, autoscaling, and multi-tenant controls — then the higher-level data orchestration model matters.

“The issue was never just where the model files lived. The challenge was making them available quickly, predictably, and affordably for production inference.”

That was the difference in our case. The issue was never just where the model files lived. The challenge was making them available quickly, predictably, and affordably for production inference.

The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure including Kubernetes, OpenTelemetry, and Argo. CNCF is the neutral home for cloud native collaboration, bringing together the industry’s top developers, end users, and vendors.
Learn More
The latest from CNCF
TRENDING STORIES
Haifeng Liao is a Senior Infrastructure Engineer at NetEase Games, where he works on AI infrastructure and compute platform reliability for large-scale game AI workloads.
Read more from Haifeng Liao
Xiang Zhang is Head of AI Infrastructure at NetEase Games, where he leads the evolution and architecture of the company’s AI infrastructure platform, with a focus on performance, availability, and cost efficiency.
Read more from Xiang Zhang
CNCF sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.