Normalisation
Multi-linear operations (especially on copies of itself) are not Lipshitz continuous. The sharpness of these functions are not bounded by linear functions but by polynomials. Specifically, the norm of activations passed through a bilinear layer can only be bounded by its square.
\[ \exists c: ||Lx \odot Rx||_2 < c ||x||^2_2 \]
A back-of-the-envelope calculation shows this bound grows exponentially (\(2^n\)) when stacking layers. It shouldn't be surprising that stacking only a few immediately leads to (extremely) unstable training. Note that the same holds for attention, which also performs element-wise multiplications.
Prior normalisation techniques such as batch normalisation, which divide by the average of the norm, are not equipped to resolve this. Mathematically, batchnorm merely reduces the constant on the Lipshitz continuity. When variance in sample norm is too high, batchnorm will fail at regularising.
Hence, any architecture that extensively uses self-multiplication relies on instance-based norms. These norms generally discard the magnitude component, while retaining the direction. The most common normalisation simply divides each activation by its \(L_2\) norm.
\[ \exists c: \left|\left|\dfrac{Lx}{||x||_2} \odot \dfrac{Rx}{||x||_2}\right|\right|_2 < c \]
The result is guaranteed Lipshitz-continuity. Note that only this normalisation operation perfectly cancels out the left-hand-side, which is probably the reason it works so well.