1. The Dichotomy of Matrices
In modern architectures like Transformers, stability is not a monolithic property. The properties of these matrices – their sizes, entries, and eigenvalues – crucially affect how information (and gradients) flow. For example, if a weight matrix has very large or very small singular values, it can amplify or attenuate signals and gradients. Orthonormal matrices (whose rows and columns are orthogonal unit vectors) preserve vector norms and avoid distortion, serving as a gold standard for stable signal propagation. In general, keeping matrix products well-conditioned (e.g. with moderate spectral norms) is key to avoiding exploding or vanishing signals in deep nets.
Stability arises from the interaction of three functionally distinct types of matrices. Understanding the “physics” of each role is essential for diagnosing training instability.
- The State Matrix (Data, $X$): The “Actor” – contains the evolving signal.
- The Weight Matrix (Parameters, $W$): The “Sculptor” – defines semantic transformations.
- The Operator Matrix (Topology, $\mathcal{H}$ or $A$): The “Conductor” – governs the mixing and flow of states.
This distinction dictates how matrices interact with data, how they jeopardize stability, and how recent advancements like Manifold-Constrained Hyper-Connections (mHC) and the Muon optimizer address these specific challenges.
2. Background - Matrix Operations & Physical Interpretations
When a neural network layer computes $y = W·x$, the matrix $W$ transforms input vector x into output y. Chaining layers means repeated products of matrices. If each layer has weight matrix $W₁, W₂, … Wₗ$, then an input $x₀$ produces $xₗ = Wₗ·Wₗ₋₁·…·W₁·x₀$. A classic stability issue is that the singular values of the product tend to multiply: if each W has singular values slightly larger than 1, repeated multiplications will exponentially amplify signals; if smaller than 1, signals vanish. In training, gradients also backpropagate through transposed products, so ill-conditioned weights lead to vanishing/exploding gradients. Careful weight initialization (like Xavier/He) and normalization layers help, but core stability often comes from network structure.
2.1 The Row-Column Duality
Consider a matrix multiplication $C = A \times B$. The operation can be interpreted through two distinct physical lenses, depending on whether we view the operation row-wise or column-wise.
Row-Space Interpretation (Left-Action)
Each row $C_{i,:}$ is a linear combination of the rows of $B$, weighted by the $i$-th row of $A$.
$$C_{i,:} = \sum_{k} A_{i,k} B_{k,:}$$In deep learning, if $B$ represents a set of entities (e.g., tokens in a sequence or residual streams), $A$ acts as a mixing operator. It redistributes information across entities without altering their internal feature representation.
Column-Space Interpretation (Right-Action)
Each column $C_{:,j}$ is a linear combination of the columns of $A$, weighted by the $j$-th column of $B$.
$$C_{:,j} = \sum_{k} B_{k,j} A_{:,k}$$If $A$ represents data and $B$ represents parameters, $B$ acts as a feature transformer. It maps the input features into a new semantic space (e.g., dimension expansion in an FFN) without mixing the independent entities.
Right-Multiplication ($XW$): Semantic Transformation
When the State matrix $X$ is multiplied by a Weight matrix $W$ on the right:
Role of $W$: Acts as a feature transformer.
Physicality: It redefines the internal “meaning” of each entity (row) independently. It maps features from one coordinate system to another.
Stability Goal: Isometry. We want $W$ to preserve the “energy” (variance/norm) of the features so they neither vanish nor explode as they move through the high-dimensional space.
Left-Multiplication ($\mathcal{H}X$): Spatial/Stream Mixing
When an Operator matrix $\mathcal{H}$ is multiplied by the State matrix $X$ on the left:
Role of $\mathcal{H}$: Acts as a mixing operator or topology.
Physicality: It determines how information is shared between different entities (rows). It creates the “wiring” of the network.
Stability Goal: Conservation. We want $\mathcal{H}$ to be a “fair distributor” of energy across different paths (residual streams) or entities (tokens).
2.2 Stability Implications
Forward Stability: Controlled by the rows of the left matrix. If the sum of weights in a row exceeds 1 (or the spectral radius $> 1$), signal magnitude amplifies layer by layer.
Backward Stability: Controlled by the columns of the left matrix (which become rows of the transposed matrix during backpropagation). If column norms are uncontrolled, gradients explode or vanish.
2.3 Cross-Architecture Comparison: The Versatility of Matrix Roles
These archetypes persist across all classic deep learning models. The fundamental difference between architectures often lies in which role is prioritized and how the “Operator” is defined.
MLP: Pure Semantic Transformation
In a Multi-Layer Perceptron (MLP), the system is almost entirely composed of Weights.
Operation: $Y = \sigma(XW)$.
Matrix Role: $X$ is the State, and $W$ is the Weight. There is no explicit Operator mixing information across samples.
Nature: Stability is managed through parameter distribution (initialization) because there is no topological complexity to regulate.
GNN: The Explicit Operator
Graph Neural Networks (GNNs) are the most direct precursors to the mHC philosophy.
Operation: $H^{(l+1)} = \sigma(\hat{A} H^{(l)} W)$.
Matrix Role: $\hat{A}$ (the Adjacency matrix) is a pure Operator. It defines the graph topology.
Nature: GNN stability relies heavily on $\hat{A}$ being “Row-Normalized” (a form of stochasticity). Without this, node features explode as they aggregate information from neighbors, a phenomenon identical to the “Identity Collapse” mHC seeks to prevent.
CNN: Implicit Spatial Weights
Convolutional Neural Networks (CNNs) fuse the Operator and Weight roles.
Operation: $Y = X * K$ (Convolution).
Matrix Role: The kernel $K$ acts as both a Weight (extracting features) and a local Operator (mixing nearby pixels).
Nature: Stability is typically maintained through the “Weight” lens (Batch Norm or weight decay), as the “Operator” part of a convolution is spatially constrained by the kernel size, preventing global signal explosion.
RNN: The Temporal Operator
Recurrent Neural Networks (RNNs) introduce a “Hidden-to-Hidden” matrix that acts as an Operator across time.
Operation: $h_t = \sigma(W_h h_{t-1} + W_x x_t)$.
Matrix Role: $W_h$ is a Temporal Operator.
Nature: Because this Operator is applied repeatedly (recursively), it is extremely sensitive. Stability here requires the Spectral Radius of $W_h$ to be close to 1, similar to the Muon or mHC constraints, to prevent gradients from vanishing or exploding over long sequences.
Summary of Matrix Roles and Stability Keys
The following table synthesizes how different architectures deploy matrices and the specific mathematical “key” required to unlock training stability for each role.
| Architecture | Primary Operation | State ($X$) | Operator ($L$) | Weight ($R$) | Key for Stability |
|---|---|---|---|---|---|
| MLP | $XW$ | Sample Data | N/A | Feature Map | Variance Preservation (Kaiming / Xavier) |
| CNN | $X * K$ | Image Patch | Local Topology | Filter Kernel | Weight Regularization / Batch Norm |
| RNN | $W_h h_{t-1}$ | Time Step | Temporal Link | N/A | Spectral Radius $\approx 1$ (Unit Circle) |
| GNN | $AXW$ | Node Feature | Graph Adjacency | Feature Map | Row / Degree Normalization |
| Transformer | $AV$ and $XW$ | Token Embedding | Attention Map | Linear Projection | Softmax (Row-Stochasticity) |
| mHC | $\mathcal{H}X$ | Residual Stream | Hyper-Connection | N/A | Double Stochasticity (Sinkhorn) |
| Any (Muon) | $W \leftarrow W + \Delta$ | N/A | N/A | Update Matrix | Orthogonality (Newton–Schulz) |
3. The Operator View: Signal Conservation and Residual Connections
Residual (skip) connections were invented to stabilize very deep nets by preserving an identity mapping: instead of each layer computing $x + F(x)$, it simply adds the input $x$ to the transformed signal $F(x)$. This lets gradients flow almost unchanged through layers. Residual connections help mitigate the problem of gradient vanishing, enabling the effective training of very deep networks.
However, even residual designs have trade-offs. The two main styles – Pre-Norm (normalize then transform) and Post-Norm (transform then normalize) – each solve one problem at the expense of another. For example:
Pre-Norm: With normalization before each block, gradients stay strong (no vanishing), but very deep layers can suffer a “representation collapse” where outputs become too similar across depth.
Post-Norm: Normalizing after each block avoids collapse, but the identity shortcut is effectively weakened and vanishing gradients can reappear.
In short, Pre-Norm vs Post-Norm is a seesaw: one end fixes vanishing gradients but risks collapse, the other fixes collapse but reopens vanishing. Both are fixed-strength (non-learnable) shortcuts, so the network cannot adjust how much of the input to carry forward. In practice, these trade-offs limit model capacity: a single narrow residual path can become a bottleneck as models scale up.
3.1 Hyper-Connections: Learning the Skip Pattern
To overcome the fixed-strength skip bottleneck, Hyper-Connections (HC)1 introduce learnable matrices $\mathcal{H}^{res}$ to mix multiple residual streams, expanding the width of the signal path. Instead of one fixed identity skip, HC creates multiple parallel “residual” streams with learnable mixing. Technically, each layer’s input is expanded into n copies (called expansion rate), and a trainable matrix governs how much of each copy flows through to the output.
$$x_{l+1} = \mathcal{H}_l^{res} x_l + \dots$$This adds depth-connections (weights on inputs vs outputs) and width-connections (mixing among hidden copies), all encoded in a single learnable matrix H. The promise is high: with only a small extra computation, the network can adjust residual strengths and even rearrange layer contributions dynamically. As Zhu et al. note, HC “lead to significantly improved performance with a negligible increase in computation and parameters”. In other words, Hyper-Connections aim to give a wider, richer skip pathway without much cost.
On paper, HC seems ideal – for example, it offers “higher information bandwidth,” “minimal FLOP increase,” and “better feature mixing across layers.”. In experiments on large language models and vision tasks, HC (even in a simple static form) indeed sped up convergence and improved results over a vanilla ResNet/Transformer baseline. HC can also be made dynamic (DHC) so connection weights depend on the input, further boosting flexibility. In summary, Hyper-Connections allow a model to learn how strongly to pass signals from previous layers, rather than hardcoding the identity.
When a matrix acts as an Operator (Left-multiplication), its primary function is topological: determining how information flows between nodes (e.g., tokens or streams). The recent proposal of Manifold-Constrained Hyper-Connections (mHC) provides a rigorous framework for stabilizing these operators.
3.2 The Problem: Unconstrained Mixing
While HC enhances expressivity, unconstrained matrix multiplication destroys the Identity Mapping property essential for deep residual learning. Replacing the identity skip with a free matrix breaks the norm-preserving guarantee of residuals. In deep models, “small amplification errors compound”.
DeepSeek’s analysis2 found that as depth increases, unconstrained HC can cause the signal norm to explode or vanish, making training “numerically unstable.” For instance, in a 27-billion-parameter model, an unconstrained HC caused signal magnitudes to increase over $3000×$, leading to catastrophic divergence. This is a structural issue: with standard (identity) residuals the skip is an exact 1× mapping (norm-preserving), but with HC the skip weights can drift away from 1. Over many layers this small drift compounds, and gradients either blow up or die out.
In short, while HC provides extra capacity, it loses the built-in stability of an identity shortcut. The composite mapping $\prod \mathcal{H}^{res}$ across layers fails to preserve the global mean of features, leading to unbounded signal amplification (exploding signals). Empirical analysis shows gain magnitudes reaching $3000\times$ in unconstrained networks.
Imagine repeatedly multiplying a vector by general matrices: without constraints, the product can grow or shrink exponentially. HC exploits matrices that are not anchored to identity, so deep signal propagation becomes unpredictable.
In practice, unconstrained HC training shows severe loss spikes and gradient blow-up – “training instability” beyond what mere tuning can fix.
3.3 The Solution: Manifold Projection
To restore stability, DeepSeek’s mHC(Manifold-Constrained Hyper-Connections)2 constrains the HC matrices onto a manifold that preserves key properties of identity mapping. Specifically, mHC forces each residual-mixing matrix $\mathcal{H}_l^{res}$ to be doubly stochastic (the Birkhoff polytope of matrices where each row and column sums to 1).
Concretely: each learnable skip matrix H is projected via the Sinkhorn–Knopp algorithm so that every row and column sums to 1. Such matrices represent convex combinations of the inputs. By doing so, mHC recovers a norm-preserving property: every output is a weighted average of inputs, so on average the feature magnitude is conserved.
Mathematically, projecting H onto the Birkhoff polytope means $H⋅x$ is a convex combination of x’s components.
Row-Stochastic Constraint (Row sum = 1): Ensures that the output signal is a convex combination of the input features. Physically, this preserves the mean signal intensity during the forward pass, preventing activation explosion.
Column-Stochastic Constraint (Column sum = 1): Since backpropagation involves the transpose $(\mathcal{H}_l^{res})^\top$, the column sum of the forward matrix becomes the row sum of the backward matrix. Enforcing column stochasticity ensures that gradients are also convex combinations, preventing gradient explosion.
Implementation: This is achieved via the Sinkhorn-Knopp algorithm, which iteratively normalizes rows and columns to converge on a doubly stochastic matrix.
By constraining the operator to this specific manifold, mHC guarantees that the spectral norm is bounded ($||\mathcal{H}||_2 \le 1$), creating a non-expansive mapping that is theoretically stable at any depth.
Empirical results confirm that mHC combines stability with improved performance. On benchmark tasks (MMLU, GSM8K, etc.), models with mHC consistently outperformed both standard Transformers and HC-variants, demonstrating that the gains are structural, not cosmetic. Impressively, these stability constraints add very little overhead: with a $4×$ wider residual stream, total training time only increased by $~6.7%$ thanks to kernel fusion and efficient scheduling (TileLang, DualPipe) described in the mHC work.
Beyond Transformers, the mHC principle has broader significance. Any deep architecture that relies on skip connections or highway paths could potentially suffer from unconstrained mixing. The insight of constraining weight matrices via manifolds (here, doubly-stochastic) draws on linear algebra fundamentals. In Convolutional Nets or RNNs, similar issues of vanishing/exploding activations arise from repeated linear ops; orthonormal or normalized weight constraints (e.g. spectral normalization) serve an analogous role. mHC shows how manifold theory meets architecture design: by embedding linear algebraic constraints directly into network topology, one can scale models without sacrificial stability.
4. The Weight View: Isometry and Stabilizing Matrix Optimization
When a matrix acts as a Weight (Right-multiplication, typically $W$ in $XW$), its function is semantic transformation. Here, stability does not come from “conservation of mass” (stochasticity) but from “conservation of geometry” (isometry).
4.1 The Problem: Ill-Conditioning - Eigenspace Imbalance
In high-dimensional transformers, weight updates often suffer from poor conditioning. The gradient landscape is dominated by a few “loud” directions (large singular values), causing standard optimizers (like SGD or Adam) to overshoot in these directions while under-learning in others.
Standard optimizers like Adam treat every parameter as an individual scalar. However, Weights $W$ are 2D structures. During training, certain “directions” in the weight matrix receive massive gradient updates, while others starve. This ill-conditioning leads to “jagged” optimization paths.
4.2 The Solution: Spectral Normalization and Orthogonalized Updates
The Muon optimizer addresses this by modifying the update step itself, rather than the weight values directly. Muon treats the weight update as a geometric operation. It is designed specifically for 2D hidden parameters (weights).
Mechanism: Muon applies a standard momentum update and then projects this update matrix onto the nearest semi-orthogonal matrix using Newton-Schulz iteration.
Geometric Stability: Orthogonalization forces all singular values of the update matrix to be 1. By forcing the update to be orthogonal, Muon ensures that the update has a spectral norm of exactly 1.
$$O^\top O = I \implies \sigma_i(O) = 1$$Stability Benefit: This ensures Isometry. The update step moves the weights with equal “force” in all dimensions of the eigenspace. This acts as a definitive cure for ill-conditioning, as it equalizes the learning rate across all latent directions, regardless of their original gradient magnitude.
5. Summary: The Unified Stability Framework
By connecting mHC and Muon, we can formulate a generalized stability protocol for deep learning architectures. The stability of a linear layer is maintained by applying specific constraints based on the matrix’s role:
| Role | Matrix | Action | Stability Requirement | Modern Solution |
|---|---|---|---|---|
| State | $X$ | Evolution | Normalization | RMSNorm / LayerNorm |
| Operator | $\mathcal{H}$ | Left-Mult (Mixing) | Conservation ($\sum_j H_{ij}=1$ / $\sum_i H_{ij}=1$) | mHC (Sinkhorn Projection) |
| Weight | $W$ | Right-Mult (Transform) | Isometry (Orthogonality / Spectral Norm) | Muon (Newton–Schulz) |
mHC creates a “safe container” for signal flow. By ensuring $\mathcal{H}^{res}$ is doubly stochastic, it guarantees that information is mixed without being created or destroyed.
Muon ensures “efficient navigation” within that container. By orthogonalizing updates, it ensures that the feature transformation $W$ evolves consistently across all dimensions, preventing the internal feature representation from collapsing into a lower-rank state.
Together, these methods represent a shift from heuristic stability (e.g., initialization tricks, normalization layers) to structural stability derived from rigorous linear algebra constraints.