Presentation
CUR-MoE: Portable Mixture-of-Experts with Interpretable High-Ratio Compression
DescriptionMixture-of-experts (MoE) architectures enable trillion-parameter models but face prohibitive memory scaling, limited compression interpretability, and vendor-specific implementations hindering heterogeneous HPC deployment.
We present the first Julia-based MoE framework introducing CUR decomposition for interpretable expert compression—a novel approach applying CUR matrix factorization to MoE architectures—with hardware-agnostic design. While SVD-based methods provide effective compression, CUR-MoE offers comparable performance with enhanced interpretability through preserved column/row structure, maintaining viability at high compression ratios (35.29 perplexity at 70% compression). Comprehensive gating evaluation reveals ExpertChoice achieves optimal load balancing. Julia's LLVM compilation enables consistent 5-6× GPU acceleration across NVIDIA, AMD, Intel, and Apple hardware.
Our core implementation is completed, validated on WikiText-2 across platforms. We are expanding comprehensive platform support for Apple Metal and Intel Arc while extending Transformers.jl and Flux.jl integrations. The poster will include visual comparisons, cross-vendor benchmarks, detailed oral explanations, and QR codes with live interactive GitHub examples demonstrating CUR structure preservation.
We present the first Julia-based MoE framework introducing CUR decomposition for interpretable expert compression—a novel approach applying CUR matrix factorization to MoE architectures—with hardware-agnostic design. While SVD-based methods provide effective compression, CUR-MoE offers comparable performance with enhanced interpretability through preserved column/row structure, maintaining viability at high compression ratios (35.29 perplexity at 70% compression). Comprehensive gating evaluation reveals ExpertChoice achieves optimal load balancing. Julia's LLVM compilation enables consistent 5-6× GPU acceleration across NVIDIA, AMD, Intel, and Apple hardware.
Our core implementation is completed, validated on WikiText-2 across platforms. We are expanding comprehensive platform support for Apple Metal and Intel Arc while extending Transformers.jl and Flux.jl integrations. The poster will include visual comparisons, cross-vendor benchmarks, detailed oral explanations, and QR codes with live interactive GitHub examples demonstrating CUR structure preservation.

Event Type
Research and ACM SRC Posters
TimeTuesday, 18 November 20258:00am - 5:00pm CST
LocationSecond Floor Atrium
Archive
view
