On Feature Learning in Structured State Space Models

Oct 31, 2024·
Leena Chennuru Vankadara*
,
Jin Xu*
,
Moritz Haas
,
Volkan Cevher
· 1 min read
Publication
NeurIPS 2024

Naively scaling standard neural network architectures and optimization algorithms loses desirable properties such as feature learning in large models (see the Tensor Program series by Greg Yang et al.). We show that for the popular Mamba architecture, the traditional scaling rules like the maximal update parameterization and its related spectral scaling condition fail to induce the correct scaling properties, due to Mambas structured Hippo matrix and its selection mechanism. We derive the correct scaling using random matrix theory that necessarily goes beyond the Tensor Programs framework.