Motivation
Why composition matters
Standard activation functions (such as ReLU, Sigmoid, etc.) are inherently non-compositional. Consider any set of input features that activates an output: ablating any of them could ‘switch off’ the ReLU. Each feature depends on the others; subsets of the input cannot be considered separately as this changes the outputs unrecoverably. Examining these activation functions thus requires the whole input. Consequently, weights cannot be meaningfully decomposed and circuits cannot be generally understood.1 In other words, non-compositional activation functions shroud the high-level relation between inputs, outputs and weights. This is why adequately describing feature interactions has remained an open problem in interpretability.
Compositional activation functions
Fortunately, there exist compositional activation functions. Consider a linear layer followed by a squared activation function --- the simplest non-linear polynomial --- written as \( (A x)^2 \). We denote a single output as \( (a^T x)^2 \) where \(x\) is the input, and \(a\) is a matrix row for that output. This can be rewritten as follows:
\[ (a^T x)^2 = (a^T x)^T (a^T x) = (x^T a) (a^T x) = x^T (a a^T) x = x^T Q x \]
In other words, the square can be seen as an element-wise product between duplicated activations. This can rearranged into a single matrix \(Q\) (created by the outer product of \(a\) with itself) that defines how the outputs are created from a duplicated \(x\). Each entry in \(Q\) describes the interaction strength between any pair of input features. Notably, the output is precisely the sum of all weighted pairwise interactions. This means a given interaction will always be present (on different scales according to the input) and cannot be suppressed. A squared activation function’s feature composition is meaningful and can be described precisely.
Formally, pairwise interactions form a linear basis for this activation function, which can be studied with well-known tools like SVD. This stack of interaction matrices can be seen as a third-order tensor, which describes a given layer exactly (and nicely). This clarifies why composition is helpful for interaction-based interpretability.
Generalising composition using tensors
The quadratic activation can be generalised by allowing asymmetric interactions \((x^T a) (b^T x)\).
This is called a bilinear layer and is an important building block for the remainder of this book.
Bilinear layers are compositional and on par with common MLP layers, even in large transformers.
This raises the question: "what makes a layer compositional, for our purposes?"
The answer is simple: "everything that is describable by a tensor."
The next section dives into what this means and how this can be used toward architectural design.
Caveats
All proposed architectures are designed to incur minimal modifications and retain maximal accuracy. Despite this effort, it is impossible to guarantee these changes generalise accordingly across tasks. For instance, all transformer modifications have been tested up to GPT2 sized models and generally achieves loss parity. However, this may change at the largest of sizes or for specific tasks that weren't considered here.
One instance in which the proposed architectures perform worse are low-data settings. Iterating over many epochs saturates our models quicker, while ReLU-based counterparts continue improving.
-
I am talking in a theoretic sense. In practice, some parts of ReLU-based networks contain near-decomposable structures that can be approximated well. However, we want to be able to analyse the parts of the networks that don't follow this pattern. ↩