Matrix Stability - Manifold-Constrained Hyper-Connections & Muon Optimizer
1. The Dichotomy of Matrices In modern architectures like Transformers, stability is not a monolithic property. The properties of these matrices – their sizes, entries, and eigenvalues – crucially affect how information (and gradients) flow. For example, if a weight matrix has very large or very small singular values, it can amplify or attenuate signals and gradients. Orthonormal matrices (whose rows and columns are orthogonal unit vectors) preserve vector norms and avoid distortion, serving as a gold standard for stable signal propagation. In general, keeping matrix products well-conditioned (e.g. with moderate spectral norms) is key to avoiding exploding or vanishing signals in deep nets. ...