Introduction

Who is this book for?

This book is aimed primarily at (mechanistic) interpretability researchers. It serves as an informal tutorial to compositional interpretability.

Familiarity with the following topics is assumed.

  • Linear algebra, specifically matrix decompositions
  • Deep learning, specifically architecture design
  • Common interpretability techniques and challenges

While the book should be accessible without any of the above, it will require some googling.

This book is in beta and may have a somewhat steep learning curve. For instance, lots of notation is required to properly understand the used methods. We're actively looking for feedback on how to improve this.

What to expect from this book?

This book roughly consists of 3 parts.

  • Combining the advantages of tensor networks and neural networks
  • Leveraging compositionality in tensor networks toward interpretability
  • Experimenting on real-world models to find interesting mechanisms

The book is design with sequential reading in mind (and less as a lookup table).

We believe most topics provide a fresh perspective, even for readers familiar with the subject.

Some reading tips

This book natively supports different themes, selectable via the brush icon on the top-left.
The authors recommend trying the 'Rust' theme, as it reduces visual strain.