MLPs

MLPs are the bread and butter of any neural network. Its most basic form consists of an up projection, a non-linearity and a down-projection. Recently, a variant called gated linear units (GLU) have become more popular due to their increased expressivity. GLUs contains an additional element-wise 'gating' operation, as shown below. Intuitively, this gate selects which information should be retained dynamically.

Mathematically, this corresponds to \( D (\sigma(Gx) \odot Ux) \), where \(\sigma\) is an activation function. Diagrammatically, this yields the following (pseudo-)network.

The matrices of a GLU are often called G(ate), U(p) and D(own)

Bilinear layers

Interestingly, experiments show that the activation function can be removed with little penalty. This yields a bilinear layer and can be expressed as a tensor network. Consequently, we will use this variant extensively within our architecture design, starting with deep (residual) MLPs. Removing the non-linearity somewhat undermines the gating intuition, hence we rename these matrices.

The matrices of a bilinear layer are denoted as L(eft), R(ight) and D(own)

In fact, these layers have a deep history in tensor literature under the name canonical polyadic decomposition (CPD).[^1] In some sense, this can be seen as a generalisation of rank for tensors. Instead of summing outer-products of two vectors, it sums products of three vectors.

Compositional Interpretability

MLPs

Bilinear layers

Residuals

Normalisation

Normalisation & residuals