VOOZH about

URL: https://huggingface.co/papers/2603.26164

โ‡ฑ Paper page - DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models


Papers
arxiv:2603.26164

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Published on Mar 27
ยท Submitted by Bohan Zeng on Apr 3
#1 Paper of the day
ยท ๐Ÿ‘ PekingUniversity
Peking University

Abstract

DataFlex is a unified framework for dynamic data-centric training of large language models that supports sample selection, domain mixture adjustment, and sample reweighting while maintaining compatibility with standard training workflows and enabling efficient large-scale deployment.

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

Community

Paper author Paper submitter

DataFlex is a data-centric training framework that enhances model performance by either selecting the most influential samples, optimizing their weights, or adjusting their mixing ratios.

DataFlex ๆ˜ฏPKU DCAIๅฎž้ชŒๅฎคๅ’ŒLLaMA-Factory ๅ›ข้˜Ÿ่”ๅˆๅผ€ๅ‘็š„็ปŸไธ€ๅคงๆจกๅž‹ๆ•ฐๆฎไธญๅฟƒๅŒ–ๅŠจๆ€่ฎญ็ปƒๆก†ๆžถ๏ผŒไธ€็ซ™ๅผๆ”ฏๆŒๆ•ฐๆฎ้€‰ๆ‹ฉใ€ๆ•ฐๆฎๆททๅˆใ€ๆ ทๆœฌ้‡ๅŠ ๆƒไธ‰ๅคงๆ ธๅฟƒ่ƒฝๅŠ›๏ผŒๅฎŒ็พŽๅ…ผๅฎนๅŽŸ็”Ÿ่ฎญ็ปƒๆต็จ‹๏ผŒ่ฟ˜ๆ”ฏๆŒ DeepSpeed ZeRO-3 ๅคง่ง„ๆจก่ฎญ็ปƒ๏ผŒ่ƒฝๅคงๅน…ๆๅ‡ๅฎž้ชŒๅฏๅค็Žฐๆ€งไธŽๆจกๅž‹ๆ•ˆๆžœ๏ผŒไธ็ฎกๆ˜ฏๅš็ ”็ฉถ่ฟ˜ๆ˜ฏๅฎž้™…ๅผ€ๅ‘๏ผŒ้ƒฝๅพˆๅฎž็”จ๏ผŒๆฌข่ฟŽไธ€่ตทไบคๆต๏ฝž

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

found a good walkthrough of this at https://arxivexplained.com/p/dataflex-a-unified-framework-for-data-centric-dynamic-training-of-large-language-models the data-centric angle is underrated imo. most people focus on architecture changes but what's in the training data, and how it shifts over time, matters just as much

ยท Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.26164

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.26164 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.26164 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.26164 in a Space README.md to link it from this page.

Collections including this paper 9