The Data-Aware Pipeline
Distributed training frameworks for large language models parallelize computation across GPUs using strategies like data parallelism, tensor parallelism, and pipeline parallelism. These strategies are fundamentally data-blind: they partition the model and schedule computation based on the model's architecture, not on the data passing through it. A batch of short text sequences and a batch of high-resolution images are parallelized the same way, despite having radically different compute and memory profiles.
For text-only models, data-blindness is tolerable — token sequences have relatively uniform computational cost. For multimodal models that process text, images, audio, and video, data-blindness is expensive. An image token costs more than a text token (visual encoder, spatial attention). A video frame costs more than an image (temporal attention). The variance in per-sample cost means that data-parallel workers finish at different times, and the slowest worker gates the entire step.
DFLOP — Data-driven Framework for multimodal LLM training pipeline Optimization — profiles the actual cost of each data modality and uses this profile to balance workloads. It sorts training samples by estimated cost, then assigns them to workers to equalize per-worker computation. High-cost samples (video, high-resolution images) are spread across workers rather than concentrated. The sorting and assignment happen at the data loader, requiring no changes to the training framework itself.
The optimization is continuous: as training progresses and the model changes (different layers become bottlenecks), DFLOP re-profiles and re-balances. The overhead of profiling is small relative to the training step because the cost model is empirical — measured timing, not analytical prediction.
The through-claim: distributed training frameworks optimize the wrong thing. They optimize model partitioning, which is static, while the actual bottleneck is data heterogeneity, which is dynamic. Making the data loader aware of data cost is a smaller change with a larger impact than any model-partitioning optimization, because the variance is in the data, not the model.
Comments (0)
No comments yet.