Performance
See recent articles
Showing new listings for Friday, 26 June 2026
- arXiv:2606.26765 [pdf, html, other]
-
Title: On-Demand Service Zone Design for Energy-Constrained Spatial Queueing SystemsSubjects: Performance (cs.PF); Optimization and Control (math.OC)
Electric service vehicles (ESVs), such as mobile chargers and drone-based service units, are becoming an important operational resource for on-demand service systems. Unlike conventional spatial servers, ESV operations are shaped by battery limits and recharging needs, which affect dispatch feasibility and spatial deployment decisions. We develop an energy-constrained hypercube spatial queueing model that embeds battery-state dynamics into the classical hypercube framework and uses a semi-Markov representation to estimate steady-state performance. We then formulate a joint location--zoning problem for station placement and service zone design. The resulting large-scale mixed-integer nonlinear program admits a set partitioning reformulation whose column coefficients are not available in closed form. We therefore develop a Branch-Price-and-Evaluation framework for set partitioning problems with externally computable column coefficients: upper-bounding surrogates guide pricing, and iterative exact evaluation updates the coefficients of active columns. Computational results show that explicit energy modeling significantly reduces false service promises and yields more credible planning decisions. They also reveal a load-dependent reversal in zoning: pooling is preferable under light demand, whereas tighter zoning becomes more profitable as demand increases. Over the tested range, profitability is driven more by zoning than by battery improvement, suggesting that managers should get service zone design right before investing in battery upgrades; this caution is reinforced by the counterintuitive finding that larger batteries may delay replenishment and reduce fleet readiness under sparse demand. These findings show that energy feasibility is not merely a matter of battery-capacity expansion, but a design dimension that shapes service-zone configuration.
New submissions (showing 1 of 1 entries)
- arXiv:2606.26344 (cross-list from cs.PL) [pdf, html, other]
-
Title: Axon: A Synthesizing Superoptimizer for Tensor ProgramsSubjects: Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
Writing high performance kernels for AI accelerators requires deep expertise in tiling, instruction selection, data layout, and operator fusion placing a significant burden on programmers. In this paper, we focus on tile based AI accelerator programs and present Axon, a synthesizing superoptimizer for tensor programs: it uses program synthesis to automatically generate target instructions from semantics specifications, and explores semantically equivalent program variants to select the best performing kernel empirically. Axon discovers algebraic transformations by propagating operators through computation graphs and uses SMT over unbounded tensors to guarantee that all transformations preserve semantics without requiring hand crafted rewrite rules. It then lowers tensor operations to target ISA instructions, explores tiling configurations constrained by hardware descriptions, and fuses operators and instructions to minimize memory traffic.
- arXiv:2606.26383 (cross-list from cs.LG) [pdf, html, other]
-
Title: SOLAR: AI-Powered Speed-of-Light Performance AnalysisQijing Huang, Sana Damani, Zhifan Ye, Athinagoras Skiadopoulos, Siva Kumar Sastry Hari, Jason Clemons, Sahil Modi, Jingquan Wang, Aditya Kane, Edward C Lin, Humphrey Shi, Christos KozyrakisSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Multiagent Systems (cs.MA); Performance (cs.PF)
How fast could a deep-learning model run on target hardware, and how far is today's implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload's theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically derives validated SOL bounds from PyTorch and JAX source code. SOLAR leverages both generative and deterministic components in its flow: an LLM frontend translates any source programs into an executable Affine Loop IR, validated by output comparison; a deterministic flow lifts the IR into an einsum graph; and an analytical backend computes unfused, fused, and cache-aware SOL bounds. SOLAR provides comprehensive operator and language coverage, produces validated bounds with zero observed SOL violations, and offers multi-fidelity analysis that tightens bounds and surfaces optimization insights. We evaluate SOLAR across KernelBench, JAX/Flax models, and robotics workloads. These experiments demonstrate four use cases: headroom analysis at multiple fidelity levels, identifying optimization opportunities, cross-platform exploration, and inverse-roofline hardware provisioning.
- arXiv:2606.26439 (cross-list from cs.IR) [pdf, html, other]
-
Title: TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product QuantizationSubjects: Information Retrieval (cs.IR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe bandwidth gap: naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded. We present TileMaxSim, a family of IO-aware Triton kernels that close this gap via (1) multi-query SRAM tiling that streams document embeddings through shared memory while accumulating per-query-token maxima in registers, reading each embedding from HBM exactly once; (2) dimension tiling that partitions the embedding dimension into 128-wide chunks, enabling scoring for d > 128 embeddings that overflow shared memory; and (3) fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring, 6.5x over fused PyTorch, 6.6-8.5x over this http URL, and 469x the scoring throughput of WARP's CPU engine on the same node. TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim. As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency). We further show constant throughput from 100K to 500K documents, data-parallel multi-GPU sharding, robustness across dimensions 64-768, and FP16/BF16/FP32 support. Concurrent work independently develops an IO-aware fused MaxSim kernel; we differ in dimension tiling for d > 128 and fused product-quantization scoring.
- arXiv:2606.26547 (cross-list from cs.PL) [pdf, html, other]
-
Title: Compiler-Driven Approximation Tuning for Hyperdimensional ComputingSubjects: Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
As Moore's law reaches its physical and economic limits, domain-specific approaches are increasingly employed to accelerate machine learning workloads. Hyperdimensional Computing (HDC) represents one such emerging paradigm, offering an alternative to conventional deep learning techniques. Rooted in cognitive models of computation, HDC is designed bottom-up with hardware efficiency as a first-class objective. HDC workloads map naturally to heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs, as well as emerging in-memory computing technologies such as Resistive RAM (ReRAM) and Phase-Change Memory (PCM). HDC algorithms are intrinsically tolerant to noise and approximation, enabling substantial performance gains with minimal accuracy loss. In this work, we introduce ApproxHDC, a framework for automated identification and application of domain-specific approximations in HDC workloads. ApproxHDC extends the HPVM-HDC compiler infrastructure to enable retargetable compilation across diverse hardware backends, including CPUs, GPUs, and simulated ReRAM and PCM-based accelerators. The space of possible approximations is exponentially large; ApproxHDC employs efficient search and analysis to navigate it and identify high-impact configurations spanning both software and hardware levels.
