KTransformers Adds AVX2 MoE Support For Viable Performance On CPUs Without AMX/AVX-512
KTransformers 0.5.3 released today for this framework for efficient inferencing and fine-tuning of large language models (LLMs) with a focus on CPU-GPU heterogeneous computing. With this release, KTransformers 0.5.3 is now more applicable for CPUs lacking Advanced Matrix Extensions (AMX) and AVX-512 in now providing some AVX2-only kernels too.
KTransformers 0.5.3 introduces AVX2-only inference support for Mixture of Experts "MoE" models. There is AVX2 inference support for BF16, FP8, and GPTQ-INT4 MoE workloads. This is very beneficial for current and recent generation Intel Core (Ultra) processors lacking AVX-512 compared to the latest Xeon servers with AMX and AVX-512 or AMD Zen 4/5 CPUs also having AVX-512. Obviously though going for a CPU with AVX-512 or AMX will yield much greater CPU-based AI inferencing performance.
This pull is what recently introduced the AVX2 inference support for kt-kernel. This new documentation outlines running KTransformers on AVX2 processors for those interested.
KTransformers 0.5.3 also brings NUMA-aware deployment improvements for finer-grained NUMA mapping in multi-socket environments, lower idle CPU overhead, speculative decode enhancements, and various other improvements.
Those interested can find KTransformers 0.5.3 downloads and all the release details over on GitHub.
KTransformers 0.5.3 introduces AVX2-only inference support for Mixture of Experts "MoE" models. There is AVX2 inference support for BF16, FP8, and GPTQ-INT4 MoE workloads. This is very beneficial for current and recent generation Intel Core (Ultra) processors lacking AVX-512 compared to the latest Xeon servers with AMX and AVX-512 or AMD Zen 4/5 CPUs also having AVX-512. Obviously though going for a CPU with AVX-512 or AMX will yield much greater CPU-based AI inferencing performance.
This pull is what recently introduced the AVX2 inference support for kt-kernel. This new documentation outlines running KTransformers on AVX2 processors for those interested.
KTransformers 0.5.3 also brings NUMA-aware deployment improvements for finer-grained NUMA mapping in multi-socket environments, lower idle CPU overhead, speculative decode enhancements, and various other improvements.
Those interested can find KTransformers 0.5.3 downloads and all the release details over on GitHub.
