AI Sparks

A new approach makes AI models leaner and faster while still learning | MIT News

Training a large artificial intelligence model is expensive, not just in dollars, but in time, energy, and computing resources. Traditionally, getting a smaller, faster model requires training a larger one first and scaling it down, or training a smaller one from the beginning and accepting poor performance.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now developed a new method that prevents this trade-off entirely, forcing models during training, rather than after.

The process, called CompreSSM, targets a family of AI architectures known as state-space models, which apply capabilities ranging from language processing to sound generation and robotics. By borrowing statistical tools from control theory, researchers can identify which parts of the model are pulling their weight and which are dead weight, before removing unnecessary parts early in the training process.

“It’s basically a way to make the models smaller and faster as they train,” said Makram Chahine, a PhD student in electrical engineering and computer science, a CSAIL affiliate, and lead author of the paper. “As they learn, they also remove parts that don’t work in their development.”

An important insight is that the relative importance of different components within these models is remarkably stable during training. Using a statistical value called Hankel singular values, which measure how much each internal state contributes to the overall behavior of the model, the team showed that they could reliably measure which dimensions were important and which were not following only about 10 percent of the training process. Once those levels are established, the less important parts can be safely discarded, and the remaining 90 percent of training continues at the slowest model speed.

“What’s interesting about this work is that it turns stress into a part of the learning process itself,” said senior author Daniela Rus, an MIT professor and director of CSAIL. “Instead of training a large model and then figuring out how to make it smaller, CompreSSM allows the model to find its own efficient structure as it learns. That’s a very different way of thinking about building AI systems.”

The results are amazing. In image classification benchmarks, the compressed models maintain almost the same accuracy as their full-size counterparts while training up to 1.5 times faster. A compressed model reduced to about a quarter of its initial size achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to only 81.8 percent for a model trained at that smaller size from scratch. For Mamba, one of the most widely used government space architectures, the method achieved nearly 4x training speedup, compressing a 128-dimensional model down to about 12 dimensions while maintaining competitive performance.

“You get great model performance, because you capture most of the complexities during the warm-up phase, and keep only the most useful regions,” Chahine said. “The model can still perform at a higher level than training a young model from scratch.”

What makes CompreSSM different from existing methods is its assumption support. Conventional pruning methods train the full model and remove the constraints after the fact, which means you still pay the full computational cost of training a large model. Knowledge extraction, another popular method, requires training a large “teacher” model to completion and then training a second, smaller “learner” model on top of it, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed compression decisions during streaming.

The team rated CompreSSM head-to-head against both alternatives. Compared to Hankel nuclear norm normalization, a newly proposed spectral technique for promoting compact state-space models, CompreSSM was more than 40 times faster, while also achieving higher accuracy. The regularization method reduced training by about 16 times because it required expensive eigenvalue calculations at each gradient step, and even then, the resulting models were inefficient. Against information filtering in CIFAR-10, CompressSM had a clear advantage for highly compressed models: In small parts of the state, the compressed models saw a significant decrease in accuracy, while the CompreSSM-compressed models maintained close to full performance. And because the distillation needs to go through both the teacher and the student at each step of the training, even the existing small student models are trained less than the full-size base.

The researchers proved statistically that the importance of the states of each model changes smoothly during training, due to the use of Weyl’s theory, and they showed empirically that the relative levels of those states remain stable. Together, these findings give clinicians hope that measures identified as false positives early on will not suddenly become negative later.

The method also comes with a pragmatic safety net. If the pressure step causes an unexpected decrease in performance, doctors can return to a previously saved test point. “It gives people control over how much they’re willing to pay for performance, rather than defining an abstract power limit,” Chahine explains.

There are practical limitations to the techniques. CompreSSM works best for models that show a strong correlation between internal state dimensions and overall performance, a structure that varies across functions and structures. The method is most effective in multi-input, multi-output (MIMO) models, where the relationship between region size and contrast is very strong. With a single-channel, single-input, single-output design, the benefits are much smaller, as those models are less sensitive to changes in the initial state size.

The theory works most cleanly for time-invariant systems, although the team has developed extensions for increasingly popular input-dependent, time-varying structures. And because the family of models of the circuit area extends to structures such as direct attention, a growing area of ​​interest as an alternative to traditional transformers, the range of possible applications is large.

Chahine and his collaborators see work as a stepping stone. The team has already demonstrated the extension of time-varying systems such as Mamba, and future directions include advancing CompreSSM to matrix-valued variable systems used in sequential attention methods, which would bring this process closer to the transformer structures that support most of today’s large AI systems.

“This should have been the first step, because that’s when the vision is neat and the way can stay organized,” Chahine said. “It’s a stepping stone to other structures that people use in the industry today.”

“The work of Chahine and his colleagues provides an interesting, theoretically based view on the compression of modern spatial models (SSMs),” said Antonio Orvieto, principal investigator of the ELLIS Institute Tübingen and leader of the independent group MPI for Intelligent Systems, who was not involved in the research. “This method provides evidence that the domain size of these models can be effectively reduced during training and that a control-theoretic perspective can effectively guide this process. The work opens up new avenues for future research, and the proposed algorithm has the potential to become a common method when pre-training large SSM-based models.”

The work, which was accepted as a conference paper at the International Conference on Advocacy for Learning 2026, will be presented later this month. It is supported, in part, by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the US Office of Naval Research.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button