Summary

While many feature-first interpretability methods exists; their formation and effect isn't understood well. We argue that compositional layers, rooted in (self-)multiplication, can overcome this. Any such compositional layer is describable by a tensor, potentially operating on copies of the input. For efficiency, these tensors are instantiated by structured connections, described by a network.

Tensor networks define how higher-order tensors are structured, often in terms of simple matrices. These networks can be described in a diagrammatic manner, which is convenient to reason about. Diagrams can convey complex structure while avoiding overly-specific math notations.

Compositional Interpretability

Summary