On Feature Learning in Structured State Space Models

Oct 31, 2024·

Leena Chennuru Vankadara*

Jin Xu*

Moritz Haas

Volkan Cevher

· 1 min read

PDF Cite

Publication

NeurIPS 2024

Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show that for the popular Mamba architecture, the traditional scaling rules like the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.

Last updated on Oct 31, 2024