Bringing Large AI Models to the Edge

More AI per MB: How Sparsity is Redefining Physical AI

The Physical AI Imperative

The “ChatGPT moment” of 2022 ushered in an era where consumers and businesses expect every device to be intelligent, responsive, and conversational. But there’s a fundamental mismatch between this expectation and reality — and the gap shows up everywhere.

Smart glasses need always-on speech recognition for commands, translation, and context-aware assistance—without heating up your face or draining the battery. TWS earbuds require ultra-low-latency automatic speech recognition (ASR) for voice assistants, call transcription, and live captions, even in noisy environments. Home appliances like ovens, washers, thermostats, and robotic vacuums need reliable voice control that works offline, instantly, and privately.

Cloud-based inference can’t meet these constraints. Latency, connectivity, privacy, and power all break down. The only viable path forward is high-performance AI running directly on-device—but that requires models to be orders of magnitude smaller and more efficient. This is where pruning and sparsity become essential.

Shrinking AI: Why Sparsity Matters

To shrink large AI models into smaller, faster, and more efficient versions suitable for deployment, researchers rely on a suite of compression techniques: pruning, quantization, and distillation. Among these, pruning—removing unnecessary parameters—offers some of the biggest gains.

A key insight from our research: Larger models can be pruned far more aggressively than smaller ones, without losing accuracy. Even better, these gains only fully materialize when pruning is paired with a full-stack approach: sparse-aware models, compilers, runtimes, and hardware working together. Without that, sparsity remains a theoretical win rather than a practical one.

How femtoAI True Sparsity Technology Unlocks Physical AI

We’ve built a full-stack platform to bring advanced AI to ultra-low-power edge devices. Inspired by principles from neuromorphic computing—especially sparsity and locality—our approach enables models to run efficiently in memory- and power-constrained environments.

The femtoAI stack includes a sparsity-first model library, a pruning, quantization, and deployment SDK, a sparsity-aware compiler, and a Sparse Processing Unit (SPU) optimized for sparse execution.

We have deployed multi-million-parameter models (up to 10M) into just 1MB of SRAM on the femtoAI SPU-001 chip. New theoretical and empirical results now show that even larger, more capable models can be compressed aggressively while preserving accuracy—unlocking far richer intelligence at the edge. Specifically, we achieve 90%+ sparsity on ~250M parameter open-source Whisper models while preserving accuracy.

Applying State-of-the-Art Pruning to Whisper

Automatic Speech Recognition (ASR) is one of the most critical workloads for edge AI. From glasses and earbuds to appliances and robots, speech is the primary interface. We focused on compressing OpenAI’s Whisper, a widely used ASR model known for its robustness and accuracy—but also its heavy computational footprint.

The Goal: Aggressively compress OpenAI’s Whisper model while maintaining speech recognition accuracy, measured by Word Error Rate (WER).

The Methodology: We applied Optimal Brain Surgeon (OBS) pruning—a second-order, Hessian-aware pruning method—and compared it against standard magnitude-based pruning, which includes local pruning (MP-local), where small-magnitude weights are removed independently within each layer, and global pruning (MP-global), where weights are ranked and removed across the entire network.

The study included 400+ pruning sweeps across pruning schedules and fine-tuning strategies, as well as scaling studies across Whisper model sizes.

Key Results

1. Larger models can be pruned more aggressively with OBS

We scaled OBS pruning across different Whisper model sizes. We found that larger models can be pruned even more aggressively, achieving up to 60% sparsity with little to no degradation in WER.

2. OBS is a superior post-training pruning strategy

OBS significantly outperforms magnitude-based methods, extending the sparsity curve by ~15–20% compared to both MP-local and MP-global approaches.

3. Fine-tuning + iterative pruning achieves ~90% sparsity

We conducted approximately 400 fine-tuning and iterative OBS pruning experiments on Whisper-tiny, systematically exploring hyperparameter combinations to identify optimal settings. The resulting Pareto front demonstrates that we can achieve up to ~90% sparsity with minimal performance degradation—a significant improvement over the ~45% sparsity limit achievable with baseline OBS.

4. Best configuration outperforms all individual approaches

The Pareto front was dominated by two configurations. By combining them into a scheduled pruning strategy, we created an optimal configuration that outperforms both individual approaches. We validated this across Whisper model sizes (tiny, base, and small).

Conclusion

By combining state-of-the-art OBS pruning, fine-tuning, and full-stack sparse acceleration, we demonstrate that large, high-quality ASR models can be compressed by up to 10× while maintaining strong accuracy—making it practical to deploy cloud-grade speech intelligence directly on-device.

The implications are concrete: always-on speech recognition for smart glasses without overheating or battery drain; ultra-low-latency ASR for earbuds, entirely offline; instant, private voice control for home appliances; and continuous speech understanding for robotics and wearables under extreme power and memory constraints.

More AI per MB is no longer a distant promise—it’s becoming the foundation of physical AI.

Ryota Sato

Deep Learning Engineer

Roja de Cande

Director of Product Management

March 31, 2026